-
RePCS: Diagnosing Data Memorization in LLM-Powered Retrieval-Augmented Generation
Authors:
Le Vu Anh,
Nguyen Viet Anh,
Mehmet Dik,
Luong Van Nghia
Abstract:
Retrieval-augmented generation (RAG) has become a common strategy for updating large language model (LLM) responses with current, external information. However, models may still rely on memorized training data, bypass the retrieved evidence, and produce contaminated outputs. We introduce Retrieval-Path Contamination Scoring (RePCS), a diagnostic method that detects such behavior without requiring…
▽ More
Retrieval-augmented generation (RAG) has become a common strategy for updating large language model (LLM) responses with current, external information. However, models may still rely on memorized training data, bypass the retrieved evidence, and produce contaminated outputs. We introduce Retrieval-Path Contamination Scoring (RePCS), a diagnostic method that detects such behavior without requiring model access or retraining. RePCS compares two inference paths: (i) a parametric path using only the query, and (ii) a retrieval-augmented path using both the query and retrieved context by computing the Kullback-Leibler (KL) divergence between their output distributions. A low divergence suggests that the retrieved context had minimal impact, indicating potential memorization. This procedure is model-agnostic, requires no gradient or internal state access, and adds only a single additional forward pass. We further derive PAC-style guarantees that link the KL threshold to user-defined false positive and false negative rates. On the Prompt-WNQA benchmark, RePCS achieves a ROC-AUC of 0.918. This result outperforms the strongest prior method by 6.5 percentage points while keeping latency overhead below 4.7% on an NVIDIA T4 GPU. RePCS offers a lightweight, black-box safeguard to verify whether a RAG system meaningfully leverages retrieval, making it especially valuable in safety-critical applications.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Hybrid Meta-learners for Estimating Heterogeneous Treatment Effects
Authors:
Zhongyuan Liang,
Lars van der Laan,
Ahmed Alaa
Abstract:
Estimating conditional average treatment effects (CATE) from observational data involves modeling decisions that differ from supervised learning, particularly concerning how to regularize model complexity. Previous approaches can be grouped into two primary "meta-learner" paradigms that impose distinct inductive biases. Indirect meta-learners first fit and regularize separate potential outcome (PO…
▽ More
Estimating conditional average treatment effects (CATE) from observational data involves modeling decisions that differ from supervised learning, particularly concerning how to regularize model complexity. Previous approaches can be grouped into two primary "meta-learner" paradigms that impose distinct inductive biases. Indirect meta-learners first fit and regularize separate potential outcome (PO) models and then estimate CATE by taking their difference, whereas direct meta-learners construct and directly regularize estimators for the CATE function itself. Neither approach consistently outperforms the other across all scenarios: indirect learners perform well when the PO functions are simple, while direct learners outperform when the CATE is simpler than individual PO functions. In this paper, we introduce the Hybrid Learner (H-learner), a novel regularization strategy that interpolates between the direct and indirect regularizations depending on the dataset at hand. The H-learner achieves this by learning intermediate functions whose difference closely approximates the CATE without necessarily requiring accurate individual approximations of the POs themselves. We demonstrate empirically that intentionally allowing suboptimal fits to the POs improves the bias-variance tradeoff in estimating CATE. Experiments conducted on semi-synthetic and real-world benchmark datasets illustrate that the H-learner consistently operates at the Pareto frontier, effectively combining the strengths of both direct and indirect meta-learners.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Protriever: End-to-End Differentiable Protein Homology Search for Fitness Prediction
Authors:
Ruben Weitzman,
Peter Mørch Groth,
Lood Van Niekerk,
Aoi Otani,
Yarin Gal,
Debora Marks,
Pascal Notin
Abstract:
Retrieving homologous protein sequences is essential for a broad range of protein modeling tasks such as fitness prediction, protein design, structure modeling, and protein-protein interactions. Traditional workflows have relied on a two-step process: first retrieving homologs via Multiple Sequence Alignments (MSA), then training models on one or more of these alignments. However, MSA-based retrie…
▽ More
Retrieving homologous protein sequences is essential for a broad range of protein modeling tasks such as fitness prediction, protein design, structure modeling, and protein-protein interactions. Traditional workflows have relied on a two-step process: first retrieving homologs via Multiple Sequence Alignments (MSA), then training models on one or more of these alignments. However, MSA-based retrieval is computationally expensive, struggles with highly divergent sequences or complex insertions & deletions patterns, and operates independently of the downstream modeling objective. We introduce Protriever, an end-to-end differentiable framework that learns to retrieve relevant homologs while simultaneously training for the target task. When applied to protein fitness prediction, Protriever achieves state-of-the-art performance compared to sequence-based models that rely on MSA-based homolog retrieval, while being two orders of magnitude faster through efficient vector search. Protriever is both architecture- and task-agnostic, and can flexibly adapt to different retrieval strategies and protein databases at inference time -- offering a scalable alternative to alignment-centric approaches.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting
Authors:
Mengjiao Ma,
Qi Ma,
Yue Li,
Jiahuan Cheng,
Runyi Yang,
Bin Ren,
Nikola Popovic,
Mingqiang Wei,
Nicu Sebe,
Luc Van Gool,
Theo Gevers,
Martin R. Oswald,
Danda Pani Paudel
Abstract:
3D Gaussian Splatting (3DGS) serves as a highly performant and efficient encoding of scene geometry, appearance, and semantics. Moreover, grounding language in 3D scenes has proven to be an effective strategy for 3D scene understanding. Current Language Gaussian Splatting line of work fall into three main groups: (i) per-scene optimization-based, (ii) per-scene optimization-free, and (iii) general…
▽ More
3D Gaussian Splatting (3DGS) serves as a highly performant and efficient encoding of scene geometry, appearance, and semantics. Moreover, grounding language in 3D scenes has proven to be an effective strategy for 3D scene understanding. Current Language Gaussian Splatting line of work fall into three main groups: (i) per-scene optimization-based, (ii) per-scene optimization-free, and (iii) generalizable approach. However, most of them are evaluated only on rendered 2D views of a handful of scenes and viewpoints close to the training views, limiting ability and insight into holistic 3D understanding. To address this gap, we propose the first large-scale benchmark that systematically assesses these three groups of methods directly in 3D space, evaluating on 1060 scenes across three indoor datasets and one outdoor dataset. Benchmark results demonstrate a clear advantage of the generalizable paradigm, particularly in relaxing the scene-specific limitation, enabling fast feed-forward inference on novel scenes, and achieving superior segmentation performance. We further introduce GaussianWorld-49K a carefully curated 3DGS dataset comprising around 49K diverse indoor and outdoor scenes obtained from multiple sources, with which we demonstrate the generalizable approach could harness strong data priors. Our codes, benchmark, and datasets will be made public to accelerate research in generalizable 3DGS scene understanding.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
Hierarchical Neural Collapse Detection Transformer for Class Incremental Object Detection
Authors:
Duc Thanh Pham,
Hong Dang Nguyen,
Nhat Minh Nguyen Quoc,
Linh Ngo Van,
Sang Dinh Viet,
Duc Anh Nguyen
Abstract:
Recently, object detection models have witnessed notable performance improvements, particularly with transformer-based models. However, new objects frequently appear in the real world, requiring detection models to continually learn without suffering from catastrophic forgetting. Although Incremental Object Detection (IOD) has emerged to address this challenge, these existing models are still not…
▽ More
Recently, object detection models have witnessed notable performance improvements, particularly with transformer-based models. However, new objects frequently appear in the real world, requiring detection models to continually learn without suffering from catastrophic forgetting. Although Incremental Object Detection (IOD) has emerged to address this challenge, these existing models are still not practical due to their limited performance and prolonged inference time. In this paper, we introduce a novel framework for IOD, called Hier-DETR: Hierarchical Neural Collapse Detection Transformer, ensuring both efficiency and competitive performance by leveraging Neural Collapse for imbalance dataset and Hierarchical relation of classes' labels.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
Improving Wildlife Out-of-Distribution Detection: Africas Big Five
Authors:
Mufhumudzi Muthivhi,
Jiahao Huo,
Fredrik Gustafsson,
Terence L. van Zyl
Abstract:
Mitigating human-wildlife conflict seeks to resolve unwanted encounters between these parties. Computer Vision provides a solution to identifying individuals that might escalate into conflict, such as members of the Big Five African animals. However, environments often contain several varied species. The current state-of-the-art animal classification models are trained under a closed-world assumpt…
▽ More
Mitigating human-wildlife conflict seeks to resolve unwanted encounters between these parties. Computer Vision provides a solution to identifying individuals that might escalate into conflict, such as members of the Big Five African animals. However, environments often contain several varied species. The current state-of-the-art animal classification models are trained under a closed-world assumption. They almost always remain overconfident in their predictions even when presented with unknown classes. This study investigates out-of-distribution (OOD) detection of wildlife, specifically the Big Five. To this end, we select a parametric Nearest Class Mean (NCM) and a non-parametric contrastive learning approach as baselines to take advantage of pretrained and projected features from popular classification encoders. Moreover, we compare our baselines to various common OOD methods in the literature. The results show feature-based methods reflect stronger generalisation capability across varying classification thresholds. Specifically, NCM with ImageNet pre-trained features achieves a 2%, 4% and 22% improvement on AUPR-IN, AUPR-OUT and AUTC over the best OOD methods, respectively. The code can be found here https://github.com/pxpana/BIG5OOD
△ Less
Submitted 7 June, 2025;
originally announced June 2025.
-
Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection
Authors:
Yu Li,
Xingyu Qiu,
Yuqian Fu,
Jie Chen,
Tianwen Qian,
Xu Zheng,
Danda Pani Paudel,
Yanwei Fu,
Xuanjing Huang,
Luc Van Gool,
Yu-Gang Jiang
Abstract:
Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, their effectiveness for CD-FSOD remains unclear due to the need for both visual realism and domain alignment. Existing strategies, such as copy-paste augmentation…
▽ More
Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, their effectiveness for CD-FSOD remains unclear due to the need for both visual realism and domain alignment. Existing strategies, such as copy-paste augmentation and text-to-image generation, often fail to preserve the correct object category or produce backgrounds coherent with the target domain, making them non-trivial to apply directly to CD-FSOD. To address these challenges, we propose Domain-RAG, a training-free, retrieval-guided compositional image generation framework tailored for CD-FSOD. Domain-RAG consists of three stages: domain-aware background retrieval, domain-guided background generation, and foreground-background composition. Specifically, the input image is first decomposed into foreground and background regions. We then retrieve semantically and stylistically similar images to guide a generative model in synthesizing a new background, conditioned on both the original and retrieved contexts. Finally, the preserved foreground is composed with the newly generated domain-aligned background to form the generated image. Without requiring any additional supervision or training, Domain-RAG produces high-quality, domain-consistent samples across diverse tasks, including CD-FSOD, remote sensing FSOD, and camouflaged FSOD. Extensive experiments show consistent improvements over strong baselines and establish new state-of-the-art results. Codes will be released upon acceptance.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Improved Allergy Wheal Detection for the Skin Prick Automated Test Device
Authors:
Rembert Daems,
Sven Seys,
Valérie Hox,
Adam Chaker,
Glynnis De Greve,
Winde Lemmens,
Anne-Lise Poirrier,
Eline Beckers,
Zuzana Diamant,
Carmen Dierickx,
Peter W. Hellings,
Caroline Huart,
Claudia Jerin,
Mark Jorissen,
Hanne Oscé,
Karolien Roux,
Mark Thompson,
Sophie Tombu,
Saartje Uyttebroek,
Andrzej Zarowski,
Senne Gorris,
Laura Van Gerven,
Dirk Loeckx,
Thomas Demeester
Abstract:
Background: The skin prick test (SPT) is the gold standard for diagnosing sensitization to inhalant allergies. The Skin Prick Automated Test (SPAT) device was designed for increased consistency in test results, and captures 32 images to be jointly used for allergy wheal detection and delineation, which leads to a diagnosis.
Materials and Methods: Using SPAT data from $868$ patients with suspecte…
▽ More
Background: The skin prick test (SPT) is the gold standard for diagnosing sensitization to inhalant allergies. The Skin Prick Automated Test (SPAT) device was designed for increased consistency in test results, and captures 32 images to be jointly used for allergy wheal detection and delineation, which leads to a diagnosis.
Materials and Methods: Using SPAT data from $868$ patients with suspected inhalant allergies, we designed an automated method to detect and delineate wheals on these images. To this end, $10,416$ wheals were manually annotated by drawing detailed polygons along the edges. The unique data-modality of the SPAT device, with $32$ images taken under distinct lighting conditions, requires a custom-made approach. Our proposed method consists of two parts: a neural network component that segments the wheals on the pixel level, followed by an algorithmic and interpretable approach for detecting and delineating the wheals.
Results: We evaluate the performance of our method on a hold-out validation set of $217$ patients. As a baseline we use a single conventionally lighted image per SPT as input to our method.
Conclusion: Using the $32$ SPAT images under various lighting conditions offers a considerably higher accuracy than a single image in conventional, uniform light.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Cross-View Multi-Modal Segmentation @ Ego-Exo4D Challenges 2025
Authors:
Yuqian Fu,
Runze Wang,
Yanwei Fu,
Danda Pani Paudel,
Luc Van Gool
Abstract:
In this report, we present a cross-view multi-modal object segmentation approach for the object correspondence task in the Ego-Exo4D Correspondence Challenges 2025. Given object queries from one perspective (e.g., ego view), the goal is to predict the corresponding object masks in another perspective (e.g., exo view). To tackle this task, we propose a multimodal condition fusion module that enhanc…
▽ More
In this report, we present a cross-view multi-modal object segmentation approach for the object correspondence task in the Ego-Exo4D Correspondence Challenges 2025. Given object queries from one perspective (e.g., ego view), the goal is to predict the corresponding object masks in another perspective (e.g., exo view). To tackle this task, we propose a multimodal condition fusion module that enhances object localization by leveraging both visual masks and textual descriptions as segmentation conditions. Furthermore, to address the visual domain gap between ego and exo views, we introduce a cross-view object alignment module that enforces object-level consistency across perspectives, thereby improving the model's robustness to viewpoint changes. Our proposed method ranked second on the leaderboard of the large-scale Ego-Exo4D object correspondence benchmark. Code will be made available at https://github.com/lovelyqian/ObjectRelator.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
RhoDARTS: Differentiable Quantum Architecture Search with Density Matrix Simulations
Authors:
Swagat Kumar,
Jan-Nico Zaech,
Colin Michael Wilmott,
Luc Van Gool
Abstract:
Variational Quantum Algorithms (VQAs) are a promising approach for leveraging powerful Noisy Intermediate-Scale Quantum (NISQ) computers. When applied to machine learning tasks, VQAs give rise to NISQ-compatible Quantum Neural Networks (QNNs), which have been shown to outperform classical neural networks with a similar number of trainable parameters. While the quantum circuit structures of VQAs fo…
▽ More
Variational Quantum Algorithms (VQAs) are a promising approach for leveraging powerful Noisy Intermediate-Scale Quantum (NISQ) computers. When applied to machine learning tasks, VQAs give rise to NISQ-compatible Quantum Neural Networks (QNNs), which have been shown to outperform classical neural networks with a similar number of trainable parameters. While the quantum circuit structures of VQAs for physics simulations are determined by the physical properties of the systems, identifying effective QNN architectures for general machine learning tasks is a difficult challenge due to the lack of domain-specific priors. Indeed, existing Quantum Architecture Search (QAS) algorithms, adaptations of classical neural architecture search techniques, often overlook the inherent quantum nature of the circuits they produce. By approaching QAS from the ground-up and from a quantum perspective, we resolve this limitation by proposing $ρ$DARTS, a differentiable QAS algorithm that models the search process as the evolution of a quantum mixed state, emerging from the search space of quantum architectures. We validate our method by finding circuits for state initialization, Hamiltonian optimization, and image classification. Further, we demonstrate better convergence against existing QAS techniques and show improved robustness levels to noise.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
BiXFormer: A Robust Framework for Maximizing Modality Effectiveness in Multi-Modal Semantic Segmentation
Authors:
Jialei Chen,
Xu Zheng,
Danda Pani Paudel,
Luc Van Gool,
Hiroshi Murase,
Daisuke Deguchi
Abstract:
Utilizing multi-modal data enhances scene understanding by providing complementary semantic and geometric information. Existing methods fuse features or distill knowledge from multiple modalities into a unified representation, improving robustness but restricting each modality's ability to fully leverage its strengths in different situations. We reformulate multi-modal semantic segmentation as a m…
▽ More
Utilizing multi-modal data enhances scene understanding by providing complementary semantic and geometric information. Existing methods fuse features or distill knowledge from multiple modalities into a unified representation, improving robustness but restricting each modality's ability to fully leverage its strengths in different situations. We reformulate multi-modal semantic segmentation as a mask-level classification task and propose BiXFormer, which integrates Unified Modality Matching (UMM) and Cross Modality Alignment (CMA) to maximize modality effectiveness and handle missing modalities. Specifically, BiXFormer first categorizes multi-modal inputs into RGB and X, where X represents any non-RGB modalities, e.g., depth, allowing separate processing for each. This design leverages the well-established pretraining for RGB, while addressing the relative lack of attention to X modalities. Then, we propose UMM, which includes Modality Agnostic Matching (MAM) and Complementary Matching (CM). MAM assigns labels to features from all modalities without considering modality differences, leveraging each modality's strengths. CM then reassigns unmatched labels to remaining unassigned features within their respective modalities, ensuring that each available modality contributes to the final prediction and mitigating the impact of missing modalities. Moreover, to further facilitate UMM, we introduce CMA, which enhances the weaker queries assigned in CM by aligning them with optimally matched queries from MAM. Experiments on both synthetic and real-world multi-modal benchmarks demonstrate the effectiveness of our method, achieving significant improvements in mIoU of +2.75% and +22.74% over the prior arts.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models
Authors:
Yan Shu,
Bin Ren,
Zhitong Xiong,
Danda Pani Paudel,
Luc Van Gool,
Begum Demir,
Nicu Sebe,
Paolo Rota
Abstract:
Large Multimodal Models (LMMs) have demonstrated strong performance in various vision-language tasks. However, they often struggle to comprehensively understand Earth Observation (EO) data, which is critical for monitoring the environment and the effects of human activity on it. In this work, we present EarthMind, a novel vision-language framework for multi-granular and multi-sensor EO data unders…
▽ More
Large Multimodal Models (LMMs) have demonstrated strong performance in various vision-language tasks. However, they often struggle to comprehensively understand Earth Observation (EO) data, which is critical for monitoring the environment and the effects of human activity on it. In this work, we present EarthMind, a novel vision-language framework for multi-granular and multi-sensor EO data understanding. EarthMind features two core components: (1) Spatial Attention Prompting (SAP), which reallocates attention within the LLM to enhance pixel-level understanding; and (2) Cross-modal Fusion, which aligns heterogeneous modalities into a shared space and adaptively reweighs tokens based on their information density for effective fusion. To facilitate multi-sensor fusion evaluation, we propose EarthMind-Bench, a comprehensive benchmark with over 2,000 human-annotated multi-sensor image-question pairs, covering a wide range of perception and reasoning tasks. Extensive experiments demonstrate the effectiveness of EarthMind. It achieves state-of-the-art performance on EarthMind-Bench, surpassing GPT-4o despite being only 4B in scale. Moreover, EarthMind outperforms existing methods on multiple public EO benchmarks, showcasing its potential to handle both multi-granular and multi-sensor challenges in a unified framework.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Hybrid Cross-domain Robust Reinforcement Learning
Authors:
Linh Le Pham Van,
Minh Hoang Nguyen,
Hung Le,
Hung The Tran,
Sunil Gupta
Abstract:
Robust reinforcement learning (RL) aims to learn policies that remain effective despite uncertainties in its environment, which frequently arise in real-world applications due to variations in environment dynamics. The robust RL methods learn a robust policy by maximizing value under the worst-case models within a predefined uncertainty set. Offline robust RL algorithms are particularly promising…
▽ More
Robust reinforcement learning (RL) aims to learn policies that remain effective despite uncertainties in its environment, which frequently arise in real-world applications due to variations in environment dynamics. The robust RL methods learn a robust policy by maximizing value under the worst-case models within a predefined uncertainty set. Offline robust RL algorithms are particularly promising in scenarios where only a fixed dataset is available and new data cannot be collected. However, these approaches often require extensive offline data, and gathering such datasets for specific tasks in specific environments can be both costly and time-consuming. Using an imperfect simulator offers a faster, cheaper, and safer way to collect data for training, but it can suffer from dynamics mismatch. In this paper, we introduce HYDRO, the first Hybrid Cross-Domain Robust RL framework designed to address these challenges. HYDRO utilizes an online simulator to complement the limited amount of offline datasets in the non-trivial context of robust RL. By measuring and minimizing performance gaps between the simulator and the worst-case models in the uncertainty set, HYDRO employs novel uncertainty filtering and prioritized sampling to select the most relevant and reliable simulator samples. Our extensive experiments demonstrate HYDRO's superior performance over existing methods across various tasks, underscoring its potential to improve sample efficiency in offline robust RL.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
StateSpaceDiffuser: Bringing Long Context to Diffusion World Models
Authors:
Nedko Savov,
Naser Kazemi,
Deheng Zhang,
Danda Pani Paudel,
Xi Wang,
Luc Van Gool
Abstract:
World models have recently become promising tools for predicting realistic visuals based on actions in complex environments. However, their reliance on a short sequence of observations causes them to quickly lose track of context. As a result, visual consistency breaks down after just a few steps, and generated scenes no longer reflect information seen earlier. This limitation of the state-of-the-…
▽ More
World models have recently become promising tools for predicting realistic visuals based on actions in complex environments. However, their reliance on a short sequence of observations causes them to quickly lose track of context. As a result, visual consistency breaks down after just a few steps, and generated scenes no longer reflect information seen earlier. This limitation of the state-of-the-art diffusion-based world models comes from their lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform on long-context tasks by integrating a sequence representation from a state-space model (Mamba), representing the entire interaction history. This design restores long-term memory without sacrificing the high-fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model's ability to reinstantiate seen content in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline, maintaining a coherent visual context for an order of magnitude more steps. It delivers consistent views in both a 2D maze navigation and a complex 3D environment. These results establish that bringing state-space representations into diffusion models is highly effective in demonstrating both visual details and long-term memory.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Manifold-aware Representation Learning for Degradation-agnostic Image Restoration
Authors:
Bin Ren,
Yawei Li,
Xu Zheng,
Yuqian Fu,
Danda Pani Paudel,
Ming-Hsuan Yang,
Luc Van Gool,
Nicu Sebe
Abstract:
Image Restoration (IR) aims to recover high quality images from degraded inputs affected by various corruptions such as noise, blur, haze, rain, and low light conditions. Despite recent advances, most existing approaches treat IR as a direct mapping problem, relying on shared representations across degradation types without modeling their structural diversity. In this work, we present MIRAGE, a un…
▽ More
Image Restoration (IR) aims to recover high quality images from degraded inputs affected by various corruptions such as noise, blur, haze, rain, and low light conditions. Despite recent advances, most existing approaches treat IR as a direct mapping problem, relying on shared representations across degradation types without modeling their structural diversity. In this work, we present MIRAGE, a unified and lightweight framework for all in one IR that explicitly decomposes the input feature space into three semantically aligned parallel branches, each processed by a specialized module attention for global context, convolution for local textures, and MLP for channel-wise statistics. This modular decomposition significantly improves generalization and efficiency across diverse degradations. Furthermore, we introduce a cross layer contrastive learning scheme that aligns shallow and latent features to enhance the discriminability of shared representations. To better capture the underlying geometry of feature representations, we perform contrastive learning in a Symmetric Positive Definite (SPD) manifold space rather than the conventional Euclidean space. Extensive experiments show that MIRAGE not only achieves new state of the art performance across a variety of degradation types but also offers a scalable solution for challenging all-in-one IR scenarios. Our code and models will be publicly available at https://amazingren.github.io/MIRAGE/.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
MLLMs are Deeply Affected by Modality Bias
Authors:
Xu Zheng,
Chenfei Liao,
Yuqian Fu,
Kaiyu Lei,
Yuanhuiyi Lyu,
Lutao Jiang,
Bin Ren,
Jialei Chen,
Jiawen Wang,
Chengxin Li,
Linfeng Zhang,
Danda Pani Paudel,
Xuanjing Huang,
Yu-Gang Jiang,
Nicu Sebe,
Dacheng Tao,
Luc Van Gool,
Xuming Hu
Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have shown promising results in integrating diverse modalities such as texts and images. MLLMs are heavily influenced by modality bias, often relying on language while under-utilizing other modalities like visual inputs. This position paper argues that MLLMs are deeply affected by modality bias. Firstly, we diagnose the current state of m…
▽ More
Recent advances in Multimodal Large Language Models (MLLMs) have shown promising results in integrating diverse modalities such as texts and images. MLLMs are heavily influenced by modality bias, often relying on language while under-utilizing other modalities like visual inputs. This position paper argues that MLLMs are deeply affected by modality bias. Firstly, we diagnose the current state of modality bias, highlighting its manifestations across various tasks. Secondly, we propose a systematic research road-map related to modality bias in MLLMs. Thirdly, we identify key factors of modality bias in MLLMs and offer actionable suggestions for future research to mitigate it. To substantiate these findings, we conduct experiments that demonstrate the influence of each factor: 1. Data Characteristics: Language data is compact and abstract, while visual data is redundant and complex, creating an inherent imbalance in learning dynamics. 2. Imbalanced Backbone Capabilities: The dominance of pretrained language models in MLLMs leads to overreliance on language and neglect of visual information. 3. Training Objectives: Current objectives often fail to promote balanced cross-modal alignment, resulting in shortcut learning biased toward language. These findings highlight the need for balanced training strategies and model architectures to better integrate multiple modalities in MLLMs. We call for interdisciplinary efforts to tackle these challenges and drive innovation in MLLM research. Our work provides a fresh perspective on modality bias in MLLMs and offers insights for developing more robust and generalizable multimodal systems-advancing progress toward Artificial General Intelligence.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
Minimizing the energy depletion in wireless rechargeable sensor networks using bi-level metaheuristic charging schemes
Authors:
Huynh Thi Thanh Binh,
Le Van Cuong,
Dang Hai Dang,
Le Trong Vinh
Abstract:
Recently, Wireless Rechargeable Sensor Networks (WRSNs) that leveraged the advantage of wireless energy transfer technology have opened a promising opportunity in solving the limited energy issue. However, an ineffective charging strategy may reduce the charging performance. Although many practical charging algorithms have been introduced, these studies mainly focus on optimizing the charging path…
▽ More
Recently, Wireless Rechargeable Sensor Networks (WRSNs) that leveraged the advantage of wireless energy transfer technology have opened a promising opportunity in solving the limited energy issue. However, an ineffective charging strategy may reduce the charging performance. Although many practical charging algorithms have been introduced, these studies mainly focus on optimizing the charging path with a fully charging approach. This approach may lead to the death of a series of sensors due to their extended charging latency. This paper introduces a novel partial charging approach that follows a bi-level optimized scheme to minimize energy depletion in WRSNs. We aim at optimizing simultaneously two factors: the charging path and time. To accomplish this, we first formulate a mathematical model of the investigated problem. We then propose two approximate algorithms in which the optimization of the charging path and the charging time are considered as the upper and lower level, respectively. The first algorithm combines a Multi-start Local Search method and a Genetic Algorithm to find a solution. The second algorithm adopts a nested approach that utilizes the advantages of the Multitasking and Covariance Matrix Adaptation Evolutionary Strategies. Experimental validations on various network scenarios demonstrate that our proposed algorithms outperform the existing works.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models
Authors:
Ruilin Yao,
Bo Zhang,
Jirui Huang,
Xinwei Long,
Yifang Zhang,
Tianyu Zou,
Yufei Wu,
Shichao Su,
Yifan Xu,
Wenxi Zeng,
Zhaoyu Yang,
Guoyou Li,
Shilan Zhang,
Zichan Li,
Yaxiong Chen,
Shengwu Xiong,
Peng Xu,
Jiajun Zhang,
Bowen Zhou,
David Clifton,
Luc Van Gool
Abstract:
Multimodal Large Language Models (MLLMs) have achieved significant advances in integrating visual and linguistic information, yet their ability to reason about complex and real-world scenarios remains limited. The existing benchmarks are usually constructed in the task-oriented manner without guarantee that different task samples come from the same data distribution, thus they often fall short in…
▽ More
Multimodal Large Language Models (MLLMs) have achieved significant advances in integrating visual and linguistic information, yet their ability to reason about complex and real-world scenarios remains limited. The existing benchmarks are usually constructed in the task-oriented manner without guarantee that different task samples come from the same data distribution, thus they often fall short in evaluating the synergistic effects of lower-level perceptual capabilities on higher-order reasoning. To lift this limitation, we contribute Lens, a multi-level benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios, forming three progressive task tiers, i.e., perception, understanding, and reasoning. One feature is that each image is equipped with rich annotations for all tasks. Thus, this dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning. In addition, our images are manully collected from the social media, in which 53% were published later than Jan. 2025. We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL. These models are released later than Dec. 2024, and none of them achieve an accuracy greater than 60% in the reasoning tasks. Project page: https://github.com/Lens4MLLMs/lens. ICCV 2025 workshop page: https://lens4mllms.github.io/mars2-workshop-iccv2025/
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
NOMAD Projection
Authors:
Brandon Duderstadt,
Zach Nussbaum,
Laurens van der Maaten
Abstract:
The rapid adoption of generative AI has driven an explosion in the size of datasets consumed and produced by AI models. Traditional methods for unstructured data visualization, such as t-SNE and UMAP, have not kept up with the pace of dataset scaling. This presents a significant challenge for AI explainability, which relies on methods such as t-SNE and UMAP for exploratory data analysis. In this p…
▽ More
The rapid adoption of generative AI has driven an explosion in the size of datasets consumed and produced by AI models. Traditional methods for unstructured data visualization, such as t-SNE and UMAP, have not kept up with the pace of dataset scaling. This presents a significant challenge for AI explainability, which relies on methods such as t-SNE and UMAP for exploratory data analysis. In this paper, we introduce Negative Or Mean Affinity Discrimination (NOMAD) Projection, the first method for unstructured data visualization via nonlinear dimensionality reduction that can run on multiple GPUs at train time. We provide theory that situates NOMAD Projection as an approximate upper bound on the InfoNC-t-SNE loss, and empirical results that demonstrate NOMAD Projection's superior performance and speed profile compared to existing state-of-the-art methods. We demonstrate the scalability of NOMAD Projection by computing the first complete data map of Multilingual Wikipedia.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Creative Preference Optimization
Authors:
Mete Ismayilzada,
Antonio Laverghetta Jr.,
Simone A. Luchini,
Reet Patel,
Antoine Bosselut,
Lonneke van der Plas,
Roger Beaty
Abstract:
While Large Language Models (LLMs) have demonstrated impressive performance across natural language generation tasks, their ability to generate truly creative content-characterized by novelty, diversity, surprise, and quality-remains limited. Existing methods for enhancing LLM creativity often focus narrowly on diversity or specific tasks, failing to address creativity's multifaceted nature in a g…
▽ More
While Large Language Models (LLMs) have demonstrated impressive performance across natural language generation tasks, their ability to generate truly creative content-characterized by novelty, diversity, surprise, and quality-remains limited. Existing methods for enhancing LLM creativity often focus narrowly on diversity or specific tasks, failing to address creativity's multifaceted nature in a generalizable way. In this work, we propose Creative Preference Optimization (CrPO), a novel alignment method that injects signals from multiple creativity dimensions into the preference optimization objective in a modular fashion. We train and evaluate creativity-augmented versions of several models using CrPO and MuCE, a new large-scale human preference dataset spanning over 200,000 human-generated responses and ratings from more than 30 psychological creativity assessments. Our models outperform strong baselines, including GPT-4o, on both automated and human evaluations, producing more novel, diverse, and surprising generations while maintaining high output quality. Additional evaluations on NoveltyBench further confirm the generalizability of our approach. Together, our results demonstrate that directly optimizing for creativity within preference frameworks is a promising direction for advancing the creative capabilities of LLMs without compromising output quality.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
Towards Rehearsal-Free Continual Relation Extraction: Capturing Within-Task Variance with Adaptive Prompting
Authors:
Bao-Ngoc Dao,
Quang Nguyen,
Luyen Ngo Dinh,
Minh Le,
Nam Le,
Linh Ngo Van
Abstract:
Memory-based approaches have shown strong performance in Continual Relation Extraction (CRE). However, storing examples from previous tasks increases memory usage and raises privacy concerns. Recently, prompt-based methods have emerged as a promising alternative, as they do not rely on storing past samples. Despite this progress, current prompt-based techniques face several core challenges in CRE,…
▽ More
Memory-based approaches have shown strong performance in Continual Relation Extraction (CRE). However, storing examples from previous tasks increases memory usage and raises privacy concerns. Recently, prompt-based methods have emerged as a promising alternative, as they do not rely on storing past samples. Despite this progress, current prompt-based techniques face several core challenges in CRE, particularly in accurately identifying task identities and mitigating catastrophic forgetting. Existing prompt selection strategies often suffer from inaccuracies, lack robust mechanisms to prevent forgetting in shared parameters, and struggle to handle both cross-task and within-task variations. In this paper, we propose WAVE++, a novel approach inspired by the connection between prefix-tuning and mixture of experts. Specifically, we introduce task-specific prompt pools that enhance flexibility and adaptability across diverse tasks while avoiding boundary-spanning risks; this design more effectively captures variations within each task and across tasks. To further refine relation classification, we incorporate label descriptions that provide richer, more global context, enabling the model to better distinguish among different relations. We also propose a training-free mechanism to improve task prediction during inference. Moreover, we integrate a generative model to consolidate prior knowledge within the shared parameters, thereby removing the need for explicit data storage. Extensive experiments demonstrate that WAVE++ outperforms state-of-the-art prompt-based and rehearsal-based methods, offering a more robust solution for continual relation extraction. Our code is publicly available at https://github.com/PiDinosauR2804/WAVE-CRE-PLUS-PLUS.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
RoFL: Robust Fingerprinting of Language Models
Authors:
Yun-Yun Tsai,
Chuan Guo,
Junfeng Yang,
Laurens van der Maaten
Abstract:
AI developers are releasing large language models (LLMs) under a variety of different licenses. Many of these licenses restrict the ways in which the models or their outputs may be used. This raises the question how license violations may be recognized. In particular, how can we identify that an API or product uses (an adapted version of) a particular LLM? We present a new method that enable model…
▽ More
AI developers are releasing large language models (LLMs) under a variety of different licenses. Many of these licenses restrict the ways in which the models or their outputs may be used. This raises the question how license violations may be recognized. In particular, how can we identify that an API or product uses (an adapted version of) a particular LLM? We present a new method that enable model developers to perform such identification via fingerprints: statistical patterns that are unique to the developer's model and robust to common alterations of that model. Our method permits model identification in a black-box setting using a limited number of queries, enabling identification of models that can only be accessed via an API or product. The fingerprints are non-invasive: our method does not require any changes to the model during training, hence by design, it does not impact model quality. Empirically, we find our method provides a high degree of robustness to common changes in the model or inference settings. In our experiments, it substantially outperforms prior art, including invasive methods that explicitly train watermarks into the model.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?
Authors:
Zihao Dongfang,
Xu Zheng,
Ziqiao Weng,
Yuanhuiyi Lyu,
Danda Pani Paudel,
Luc Van Gool,
Kailun Yang,
Xuming Hu
Abstract:
The 180x360 omnidirectional field of view captured by 360-degree cameras enables their use in a wide range of applications such as embodied AI and virtual reality. Although recent advances in multimodal large language models (MLLMs) have shown promise in visual-spatial reasoning, most studies focus on standard pinhole-view images, leaving omnidirectional perception largely unexplored. In this pape…
▽ More
The 180x360 omnidirectional field of view captured by 360-degree cameras enables their use in a wide range of applications such as embodied AI and virtual reality. Although recent advances in multimodal large language models (MLLMs) have shown promise in visual-spatial reasoning, most studies focus on standard pinhole-view images, leaving omnidirectional perception largely unexplored. In this paper, we ask: Are MLLMs ready for omnidirectional spatial reasoning? To investigate this, we introduce OSR-Bench, the first benchmark specifically designed for this setting. OSR-Bench includes over 153,000 diverse question-answer pairs grounded in high-fidelity panoramic indoor scene maps. It covers key reasoning types including object counting, relative distance, and direction. We also propose a negative sampling strategy that inserts non-existent objects into prompts to evaluate hallucination and grounding robustness. For fine-grained analysis, we design a two-stage evaluation framework assessing both cognitive map generation and QA accuracy using rotation-invariant matching and a combination of rule-based and LLM-based metrics. We evaluate eight state-of-the-art MLLMs, including GPT-4o, Gemini 1.5 Pro, and leading open-source models under zero-shot settings. Results show that current models struggle with spatial reasoning in panoramic contexts, highlighting the need for more perceptually grounded MLLMs. OSR-Bench and code will be released at: https://huggingface.co/datasets/UUUserna/OSR-Bench
△ Less
Submitted 17 May, 2025;
originally announced May 2025.
-
Camera-Only 3D Panoptic Scene Completion for Autonomous Driving through Differentiable Object Shapes
Authors:
Nicola Marinello,
Simen Cassiman,
Jonas Heylen,
Marc Proesmans,
Luc Van Gool
Abstract:
Autonomous vehicles need a complete map of their surroundings to plan and act. This has sparked research into the tasks of 3D occupancy prediction, 3D scene completion, and 3D panoptic scene completion, which predict a dense map of the ego vehicle's surroundings as a voxel grid. Scene completion extends occupancy prediction by predicting occluded regions of the voxel grid, and panoptic scene compl…
▽ More
Autonomous vehicles need a complete map of their surroundings to plan and act. This has sparked research into the tasks of 3D occupancy prediction, 3D scene completion, and 3D panoptic scene completion, which predict a dense map of the ego vehicle's surroundings as a voxel grid. Scene completion extends occupancy prediction by predicting occluded regions of the voxel grid, and panoptic scene completion further extends this task by also distinguishing object instances within the same class; both aspects are crucial for path planning and decision-making. However, 3D panoptic scene completion is currently underexplored. This work introduces a novel framework for 3D panoptic scene completion that extends existing 3D semantic scene completion models. We propose an Object Module and Panoptic Module that can easily be integrated with 3D occupancy and scene completion methods presented in the literature. Our approach leverages the available annotations in occupancy benchmarks, allowing individual object shapes to be learned as a differentiable problem. The code is available at https://github.com/nicolamarinello/OffsetOcc .
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Predicting butterfly species presence from satellite imagery using soft contrastive regularisation
Authors:
Thijs L van der Plas,
Stephen Law,
Michael JO Pocock
Abstract:
The growing demand for scalable biodiversity monitoring methods has fuelled interest in remote sensing data, due to its widespread availability and extensive coverage. Traditionally, the application of remote sensing to biodiversity research has focused on mapping and monitoring habitats, but with increasing availability of large-scale citizen-science wildlife observation data, recent methods have…
▽ More
The growing demand for scalable biodiversity monitoring methods has fuelled interest in remote sensing data, due to its widespread availability and extensive coverage. Traditionally, the application of remote sensing to biodiversity research has focused on mapping and monitoring habitats, but with increasing availability of large-scale citizen-science wildlife observation data, recent methods have started to explore predicting multi-species presence directly from satellite images. This paper presents a new data set for predicting butterfly species presence from satellite data in the United Kingdom. We experimentally optimise a Resnet-based model to predict multi-species presence from 4-band satellite images, and find that this model especially outperforms the mean rate baseline for locations with high species biodiversity. To improve performance, we develop a soft, supervised contrastive regularisation loss that is tailored to probabilistic labels (such as species-presence data), and demonstrate that this improves prediction accuracy. In summary, our new data set and contrastive regularisation method contribute to the open challenge of accurately predicting species biodiversity from remote sensing data, which is key for efficient biodiversity monitoring.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Beyond the Known: Decision Making with Counterfactual Reasoning Decision Transformer
Authors:
Minh Hoang Nguyen,
Linh Le Pham Van,
Thommen George Karimpanal,
Sunil Gupta,
Hung Le
Abstract:
Decision Transformers (DT) play a crucial role in modern reinforcement learning, leveraging offline datasets to achieve impressive results across various domains. However, DT requires high-quality, comprehensive data to perform optimally. In real-world applications, the lack of training data and the scarcity of optimal behaviours make training on offline datasets challenging, as suboptimal data ca…
▽ More
Decision Transformers (DT) play a crucial role in modern reinforcement learning, leveraging offline datasets to achieve impressive results across various domains. However, DT requires high-quality, comprehensive data to perform optimally. In real-world applications, the lack of training data and the scarcity of optimal behaviours make training on offline datasets challenging, as suboptimal data can hinder performance. To address this, we propose the Counterfactual Reasoning Decision Transformer (CRDT), a novel framework inspired by counterfactual reasoning. CRDT enhances DT ability to reason beyond known data by generating and utilizing counterfactual experiences, enabling improved decision-making in unseen scenarios. Experiments across Atari and D4RL benchmarks, including scenarios with limited data and altered dynamics, demonstrate that CRDT outperforms conventional DT approaches. Additionally, reasoning counterfactually allows the DT agent to obtain stitching abilities, combining suboptimal trajectories, without architectural modifications. These results highlight the potential of counterfactual reasoning to enhance reinforcement learning agents' performance and generalization capabilities.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Reducing Unimodal Bias in Multi-Modal Semantic Segmentation with Multi-Scale Functional Entropy Regularization
Authors:
Xu Zheng,
Yuanhuiyi Lyu,
Lutao Jiang,
Danda Pani Paudel,
Luc Van Gool,
Xuming Hu
Abstract:
Fusing and balancing multi-modal inputs from novel sensors for dense prediction tasks, particularly semantic segmentation, is critically important yet remains a significant challenge. One major limitation is the tendency of multi-modal frameworks to over-rely on easily learnable modalities, a phenomenon referred to as unimodal dominance or bias. This issue becomes especially problematic in real-wo…
▽ More
Fusing and balancing multi-modal inputs from novel sensors for dense prediction tasks, particularly semantic segmentation, is critically important yet remains a significant challenge. One major limitation is the tendency of multi-modal frameworks to over-rely on easily learnable modalities, a phenomenon referred to as unimodal dominance or bias. This issue becomes especially problematic in real-world scenarios where the dominant modality may be unavailable, resulting in severe performance degradation. To this end, we apply a simple but effective plug-and-play regularization term based on functional entropy, which introduces no additional parameters or modules. This term is designed to intuitively balance the contribution of each visual modality to the segmentation results. Specifically, we leverage the log-Sobolev inequality to bound functional entropy using functional-Fisher-information. By maximizing the information contributed by each visual modality, our approach mitigates unimodal dominance and establishes a more balanced and robust segmentation framework. A multi-scale regularization module is proposed to apply our proposed plug-and-play term on high-level features and also segmentation predictions for more balanced multi-modal learning. Extensive experiments on three datasets demonstrate that our proposed method achieves superior performance, i.e., +13.94%, +3.25%, and +3.64%, without introducing any additional parameters.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
Split Matching for Inductive Zero-shot Semantic Segmentation
Authors:
Jialei Chen,
Xu Zheng,
Dongyue Li,
Chong Yi,
Seigo Ito,
Danda Pani Paudel,
Luc Van Gool,
Hiroshi Murase,
Daisuke Deguchi
Abstract:
Zero-shot Semantic Segmentation (ZSS) aims to segment categories that are not annotated during training. While fine-tuning vision-language models has achieved promising results, these models often overfit to seen categories due to the lack of supervision for unseen classes. As an alternative to fully supervised approaches, query-based segmentation has shown great latent in ZSS, as it enables objec…
▽ More
Zero-shot Semantic Segmentation (ZSS) aims to segment categories that are not annotated during training. While fine-tuning vision-language models has achieved promising results, these models often overfit to seen categories due to the lack of supervision for unseen classes. As an alternative to fully supervised approaches, query-based segmentation has shown great latent in ZSS, as it enables object localization without relying on explicit labels. However, conventional Hungarian matching, a core component in query-based frameworks, needs full supervision and often misclassifies unseen categories as background in the setting of ZSS. To address this issue, we propose Split Matching (SM), a novel assignment strategy that decouples Hungarian matching into two components: one for seen classes in annotated regions and another for latent classes in unannotated regions (referred to as unseen candidates). Specifically, we partition the queries into seen and candidate groups, enabling each to be optimized independently according to its available supervision. To discover unseen candidates, we cluster CLIP dense features to generate pseudo masks and extract region-level embeddings using CLS tokens. Matching is then conducted separately for the two groups based on both class-level similarity and mask-level consistency. Additionally, we introduce a Multi-scale Feature Enhancement (MFE) module that refines decoder features through residual multi-scale aggregation, improving the model's ability to capture spatial details across resolutions. SM is the first to introduce decoupled Hungarian matching under the inductive ZSS setting, and achieves state-of-the-art performance on two standard benchmarks.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
One2Any: One-Reference 6D Pose Estimation for Any Object
Authors:
Mengya Liu,
Siyuan Li,
Ajad Chhatkuli,
Prune Truong,
Luc Van Gool,
Federico Tombari
Abstract:
6D object pose estimation remains challenging for many applications due to dependencies on complete 3D models, multi-view images, or training limited to specific object categories. These requirements make generalization to novel objects difficult for which neither 3D models nor multi-view images may be available. To address this, we propose a novel method One2Any that estimates the relative 6-degr…
▽ More
6D object pose estimation remains challenging for many applications due to dependencies on complete 3D models, multi-view images, or training limited to specific object categories. These requirements make generalization to novel objects difficult for which neither 3D models nor multi-view images may be available. To address this, we propose a novel method One2Any that estimates the relative 6-degrees of freedom (DOF) object pose using only a single reference-single query RGB-D image, without prior knowledge of its 3D model, multi-view data, or category constraints. We treat object pose estimation as an encoding-decoding process, first, we obtain a comprehensive Reference Object Pose Embedding (ROPE) that encodes an object shape, orientation, and texture from a single reference view. Using this embedding, a U-Net-based pose decoding module produces Reference Object Coordinate (ROC) for new views, enabling fast and accurate pose estimation. This simple encoding-decoding framework allows our model to be trained on any pair-wise pose data, enabling large-scale training and demonstrating great scalability. Experiments on multiple benchmark datasets demonstrate that our model generalizes well to novel objects, achieving state-of-the-art accuracy and robustness even rivaling methods that require multi-view or CAD inputs, at a fraction of compute.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
SubGrapher: Visual Fingerprinting of Chemical Structures
Authors:
Lucas Morin,
Gerhard Ingmar Meijer,
Valéry Weber,
Luc Van Gool,
Peter W. J. Staar
Abstract:
Automatic extraction of chemical structures from scientific literature plays a crucial role in accelerating research across fields ranging from drug discovery to materials science. Patent documents, in particular, contain molecular information in visual form, which is often inaccessible through traditional text-based searches. In this work, we introduce SubGrapher, a method for the visual fingerpr…
▽ More
Automatic extraction of chemical structures from scientific literature plays a crucial role in accelerating research across fields ranging from drug discovery to materials science. Patent documents, in particular, contain molecular information in visual form, which is often inaccessible through traditional text-based searches. In this work, we introduce SubGrapher, a method for the visual fingerprinting of chemical structure images. Unlike conventional Optical Chemical Structure Recognition (OCSR) models that attempt to reconstruct full molecular graphs, SubGrapher focuses on extracting molecular fingerprints directly from chemical structure images. Using learning-based instance segmentation, SubGrapher identifies functional groups and carbon backbones, constructing a substructure-based fingerprint that enables chemical structure retrieval. Our approach is evaluated against state-of-the-art OCSR and fingerprinting methods, demonstrating superior retrieval performance and robustness across diverse molecular depictions. The dataset, models, and code will be made publicly available.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
A Statistical Approach for Synthetic EEG Data Generation
Authors:
Gideon Vos,
Maryam Ebrahimpour,
Liza van Eijk,
Zoltan Sarnyai,
Mostafa Rahimi Azghadi
Abstract:
Electroencephalogram (EEG) data is crucial for diagnosing mental health conditions but is costly and time-consuming to collect at scale. Synthetic data generation offers a promising solution to augment datasets for machine learning applications. However, generating high-quality synthetic EEG that preserves emotional and mental health signals remains challenging. This study proposes a method combin…
▽ More
Electroencephalogram (EEG) data is crucial for diagnosing mental health conditions but is costly and time-consuming to collect at scale. Synthetic data generation offers a promising solution to augment datasets for machine learning applications. However, generating high-quality synthetic EEG that preserves emotional and mental health signals remains challenging. This study proposes a method combining correlation analysis and random sampling to generate realistic synthetic EEG data.
We first analyze interdependencies between EEG frequency bands using correlation analysis. Guided by this structure, we generate synthetic samples via random sampling. Samples with high correlation to real data are retained and evaluated through distribution analysis and classification tasks. A Random Forest model trained to distinguish synthetic from real EEG performs at chance level, indicating high fidelity.
The generated synthetic data closely match the statistical and structural properties of the original EEG, with similar correlation coefficients and no significant differences in PERMANOVA tests. This method provides a scalable, privacy-preserving approach for augmenting EEG datasets, enabling more efficient model training in mental health research.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation
Authors:
Bin Ren,
Eduard Zamfir,
Zongwei Wu,
Yawei Li,
Yidi Li,
Danda Pani Paudel,
Radu Timofte,
Ming-Hsuan Yang,
Luc Van Gool,
Nicu Sebe
Abstract:
Restoring any degraded image efficiently via just one model has become increasingly significant and impactful, especially with the proliferation of mobile devices. Traditional solutions typically involve training dedicated models per degradation, resulting in inefficiency and redundancy. More recent approaches either introduce additional modules to learn visual prompts, significantly increasing mo…
▽ More
Restoring any degraded image efficiently via just one model has become increasingly significant and impactful, especially with the proliferation of mobile devices. Traditional solutions typically involve training dedicated models per degradation, resulting in inefficiency and redundancy. More recent approaches either introduce additional modules to learn visual prompts, significantly increasing model size, or incorporate cross-modal transfer from large language models trained on vast datasets, adding complexity to the system architecture. In contrast, our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations to enable both efficient and comprehensive restoration through a joint embedding mechanism, without scaling up the model or relying on large language models.Specifically, we examine the sub-latent space of each input, identifying key components and reweighting them first in a gated manner. To fuse the intrinsic degradation awareness and the contextualized attention, a spatial-frequency parallel fusion strategy is proposed for enhancing spatial-aware local-global interactions and enriching the restoration details from the frequency perspective. Extensive benchmarking in the all-in-one restoration setting confirms AnyIR's SOTA performance, reducing model complexity by around 82\% in parameters and 85\% in FLOPs. Our code will be available at our Project page (https://amazingren.github.io/AnyIR/)
△ Less
Submitted 19 April, 2025;
originally announced April 2025.
-
Automated Taxi Booking Operations for Autonomous Vehicles
Authors:
Linh Van Ma,
Shoaib Azam,
Farzeen Munir,
Moongu Jeon
Abstract:
In a conventional taxi booking system, all taxi operations are mostly done by a decision made by drivers which is hard to implement in unmanned vehicles. To address this challenge, we introduce a taxi booking system which assists autonomous vehicles to pick up customers. The system can allocate an autonomous vehicle (AV) as well as plan service trips for a customer request. We use our own AV to se…
▽ More
In a conventional taxi booking system, all taxi operations are mostly done by a decision made by drivers which is hard to implement in unmanned vehicles. To address this challenge, we introduce a taxi booking system which assists autonomous vehicles to pick up customers. The system can allocate an autonomous vehicle (AV) as well as plan service trips for a customer request. We use our own AV to serve a customer who uses a mobile application to make his taxi request. Apart from customer and AV, we build a server to monitor customers and AVs. It also supports inter-communication between a customer and an AV once AV decided to pick up a customer.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
NTIRE 2025 Challenge on Event-Based Image Deblurring: Methods and Results
Authors:
Lei Sun,
Andrea Alfarano,
Peiqi Duan,
Shaolin Su,
Kaiwei Wang,
Boxin Shi,
Radu Timofte,
Danda Pani Paudel,
Luc Van Gool,
Qinglin Liu,
Wei Yu,
Xiaoqian Lv,
Lu Yang,
Shuigen Wang,
Shengping Zhang,
Xiangyang Ji,
Long Bao,
Yuqiang Yang,
Jinao Song,
Ziyi Wang,
Shuang Wen,
Heng Sun,
Kean Liu,
Mingchen Zhong,
Senyan Xu
, et al. (63 additional authors not shown)
Abstract:
This paper presents an overview of NTIRE 2025 the First Challenge on Event-Based Image Deblurring, detailing the proposed methodologies and corresponding results. The primary goal of the challenge is to design an event-based method that achieves high-quality image deblurring, with performance quantitatively assessed using Peak Signal-to-Noise Ratio (PSNR). Notably, there are no restrictions on com…
▽ More
This paper presents an overview of NTIRE 2025 the First Challenge on Event-Based Image Deblurring, detailing the proposed methodologies and corresponding results. The primary goal of the challenge is to design an event-based method that achieves high-quality image deblurring, with performance quantitatively assessed using Peak Signal-to-Noise Ratio (PSNR). Notably, there are no restrictions on computational complexity or model size. The task focuses on leveraging both events and images as inputs for single-image deblurring. A total of 199 participants registered, among whom 15 teams successfully submitted valid results, offering valuable insights into the current state of event-based image deblurring. We anticipate that this challenge will drive further advancements in event-based vision research.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
The Tenth NTIRE 2025 Image Denoising Challenge Report
Authors:
Lei Sun,
Hang Guo,
Bin Ren,
Luc Van Gool,
Radu Timofte,
Yawei Li,
Xiangyu Kong,
Hyunhee Park,
Xiaoxuan Yu,
Suejin Han,
Hakjae Jeon,
Jia Li,
Hyung-Ju Chun,
Donghun Ryou,
Inju Ha,
Bohyung Han,
Jingyu Ma,
Zhijuan Huang,
Huiyuan Fu,
Hongyuan Yu,
Boqi Zhang,
Jiawei Shi,
Heng Zhang,
Huadong Ma,
Deepak Kumar Tyagi
, et al. (69 additional authors not shown)
Abstract:
This paper presents an overview of the NTIRE 2025 Image Denoising Challenge (σ = 50), highlighting the proposed methodologies and corresponding results. The primary objective is to develop a network architecture capable of achieving high-quality denoising performance, quantitatively evaluated using PSNR, without constraints on computational complexity or model size. The task assumes independent ad…
▽ More
This paper presents an overview of the NTIRE 2025 Image Denoising Challenge (σ = 50), highlighting the proposed methodologies and corresponding results. The primary objective is to develop a network architecture capable of achieving high-quality denoising performance, quantitatively evaluated using PSNR, without constraints on computational complexity or model size. The task assumes independent additive white Gaussian noise (AWGN) with a fixed noise level of 50. A total of 290 participants registered for the challenge, with 20 teams successfully submitting valid results, providing insights into the current state-of-the-art in image denoising.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
NTIRE 2025 Challenge on Cross-Domain Few-Shot Object Detection: Methods and Results
Authors:
Yuqian Fu,
Xingyu Qiu,
Bin Ren,
Yanwei Fu,
Radu Timofte,
Nicu Sebe,
Ming-Hsuan Yang,
Luc Van Gool,
Kaijin Zhang,
Qingpeng Nong,
Xiugang Dong,
Hong Gao,
Xiangsheng Zhou,
Jiancheng Pan,
Yanxing Liu,
Xiao He,
Jiahao Li,
Yuze Sun,
Xiaomeng Huang,
Zhenyu Zhang,
Ran Ma,
Yuhan Liu,
Zijian Zhuang,
Shuai Yi,
Yixiong Zou
, et al. (37 additional authors not shown)
Abstract:
Cross-Domain Few-Shot Object Detection (CD-FSOD) poses significant challenges to existing object detection and few-shot detection models when applied across domains. In conjunction with NTIRE 2025, we organized the 1st CD-FSOD Challenge, aiming to advance the performance of current object detectors on entirely novel target domains with only limited labeled data. The challenge attracted 152 registe…
▽ More
Cross-Domain Few-Shot Object Detection (CD-FSOD) poses significant challenges to existing object detection and few-shot detection models when applied across domains. In conjunction with NTIRE 2025, we organized the 1st CD-FSOD Challenge, aiming to advance the performance of current object detectors on entirely novel target domains with only limited labeled data. The challenge attracted 152 registered participants, received submissions from 42 teams, and concluded with 13 teams making valid final submissions. Participants approached the task from diverse perspectives, proposing novel models that achieved new state-of-the-art (SOTA) results under both open-source and closed-source settings. In this report, we present an overview of the 1st NTIRE 2025 CD-FSOD Challenge, highlighting the proposed solutions and summarizing the results submitted by the participants.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Sequence models for by-trial decoding of cognitive strategies from neural data
Authors:
Rick den Otter,
Gabriel Weindel,
Sjoerd Stuit,
Leendert van Maanen
Abstract:
Understanding the sequence of cognitive operations that underlie decision-making is a fundamental challenge in cognitive neuroscience. Traditional approaches often rely on group-level statistics, which obscure trial-by-trial variations in cognitive strategies. In this study, we introduce a novel machine learning method that combines Hidden Multivariate Pattern analysis with a Structured State Spac…
▽ More
Understanding the sequence of cognitive operations that underlie decision-making is a fundamental challenge in cognitive neuroscience. Traditional approaches often rely on group-level statistics, which obscure trial-by-trial variations in cognitive strategies. In this study, we introduce a novel machine learning method that combines Hidden Multivariate Pattern analysis with a Structured State Space Sequence model to decode cognitive strategies from electroencephalography data at the trial level. We apply this method to a decision-making task, where participants were instructed to prioritize either speed or accuracy in their responses. Our results reveal an additional cognitive operation, labeled Confirmation, which seems to occur predominantly in the accuracy condition but also frequently in the speed condition. The modeled probability that this operation occurs is associated with higher probability of responding correctly as well as changes of mind, as indexed by electromyography data. By successfully modeling cognitive operations at the trial level, we provide empirical evidence for dynamic variability in decision strategies, challenging the assumption of homogeneous cognitive processes within experimental conditions. Our approach shows the potential of sequence modeling in cognitive neuroscience to capture trial-level variability that is obscured by aggregate analyses. The introduced method offers a new way to detect and understand cognitive strategies in a data-driven manner, with implications for both theoretical research and practical applications in many fields.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Low-Light Image Enhancement using Event-Based Illumination Estimation
Authors:
Lei Sun,
Yuhan Bao,
Jiajun Zhai,
Jingyun Liang,
Yulun Zhang,
Kaiwei Wang,
Danda Pani Paudel,
Luc Van Gool
Abstract:
Low-light image enhancement (LLIE) aims to improve the visibility of images captured in poorly lit environments. Prevalent event-based solutions primarily utilize events triggered by motion, i.e., ''motion events'' to strengthen only the edge texture, while leaving the high dynamic range and excellent low-light responsiveness of event cameras largely unexplored. This paper instead opens a new aven…
▽ More
Low-light image enhancement (LLIE) aims to improve the visibility of images captured in poorly lit environments. Prevalent event-based solutions primarily utilize events triggered by motion, i.e., ''motion events'' to strengthen only the edge texture, while leaving the high dynamic range and excellent low-light responsiveness of event cameras largely unexplored. This paper instead opens a new avenue from the perspective of estimating the illumination using ''temporal-mapping'' events, i.e., by converting the timestamps of events triggered by a transmittance modulation into brightness values. The resulting fine-grained illumination cues facilitate a more effective decomposition and enhancement of the reflectance component in low-light images through the proposed Illumination-aided Reflectance Enhancement module. Furthermore, the degradation model of temporal-mapping events under low-light condition is investigated for realistic training data synthesizing. To address the lack of datasets under this regime, we construct a beam-splitter setup and collect EvLowLight dataset that includes images, temporal-mapping events, and motion events. Extensive experiments across 5 synthetic datasets and our real-world EvLowLight dataset substantiate that the devised pipeline, dubbed RetinEV, excels in producing well-illuminated, high dynamic range images, outperforming previous state-of-the-art event-based methods by up to 6.62 dB, while maintaining an efficient inference speed of 35.6 frame-per-second on a 640X480 image.
△ Less
Submitted 12 April, 2025;
originally announced April 2025.
-
Learning Verified Monitors for Hidden Markov Models
Authors:
Luko van der Maas,
Sebastian Junges
Abstract:
Runtime monitors assess whether a system is in an unsafe state based on a stream of observations. We study the problem where the system is subject to probabilistic uncertainty and described by a hidden Markov model. A stream of observations is then unsafe if the probability of being in an unsafe state is above a threshold. A correct monitor recognizes the set of unsafe observations. The key contri…
▽ More
Runtime monitors assess whether a system is in an unsafe state based on a stream of observations. We study the problem where the system is subject to probabilistic uncertainty and described by a hidden Markov model. A stream of observations is then unsafe if the probability of being in an unsafe state is above a threshold. A correct monitor recognizes the set of unsafe observations. The key contribution of this paper is the first correct-by-construction synthesis method for such monitors, represented as finite automata. The contribution combines four ingredients: First, we establish the coNP-hardness of checking whether an automaton is a correct monitor, i.e., a monitor without misclassifications. Second, we provide a reduction that reformulates the search for misclassifications into a standard probabilistic system synthesis problem. Third, we integrate the verification routine into an active automata learning routine to synthesize correct monitors. Fourth, we provide a prototypical implementation that shows the feasibility and limitations of the approach on a series of benchmarks.
△ Less
Submitted 11 April, 2025; v1 submitted 8 April, 2025;
originally announced April 2025.
-
Towards deployment-centric multimodal AI beyond vision and language
Authors:
Xianyuan Liu,
Jiayang Zhang,
Shuo Zhou,
Thijs L. van der Plas,
Avish Vijayaraghavan,
Anastasiia Grishina,
Mengdie Zhuang,
Daniel Schofield,
Christopher Tomlinson,
Yuhan Wang,
Ruizhe Li,
Louisa van Zeeland,
Sina Tabakhi,
Cyndie Demeocq,
Xiang Li,
Arunav Das,
Orlando Timmerman,
Thomas Baldwin-McDonald,
Jinge Wu,
Peizhen Bai,
Zahraa Al Sahili,
Omnia Alwazzan,
Thao N. Do,
Mohammod N. I. Suvon,
Angeline Wang
, et al. (23 additional authors not shown)
Abstract:
Multimodal artificial intelligence (AI) integrates diverse types of data via machine learning to improve understanding, prediction, and decision-making across disciplines such as healthcare, science, and engineering. However, most multimodal AI advances focus on models for vision and language data, while their deployability remains a key challenge. We advocate a deployment-centric workflow that in…
▽ More
Multimodal artificial intelligence (AI) integrates diverse types of data via machine learning to improve understanding, prediction, and decision-making across disciplines such as healthcare, science, and engineering. However, most multimodal AI advances focus on models for vision and language data, while their deployability remains a key challenge. We advocate a deployment-centric workflow that incorporates deployment constraints early to reduce the likelihood of undeployable solutions, complementing data-centric and model-centric approaches. We also emphasise deeper integration across multiple levels of multimodality and multidisciplinary collaboration to significantly broaden the research scope beyond vision and language. To facilitate this approach, we identify common multimodal-AI-specific challenges shared across disciplines and examine three real-world use cases: pandemic response, self-driving car design, and climate change adaptation, drawing expertise from healthcare, social science, engineering, science, sustainability, and finance. By fostering multidisciplinary dialogue and open research practices, our community can accelerate deployment-centric development for broad societal impact.
△ Less
Submitted 4 April, 2025;
originally announced April 2025.
-
Exploration-Driven Generative Interactive Environments
Authors:
Nedko Savov,
Naser Kazemi,
Mohammad Mahdi,
Danda Pani Paudel,
Xi Wang,
Luc Van Gool
Abstract:
Modern world models require costly and time-consuming collection of large video datasets with action demonstrations by people or by environment-specific agents. To simplify training, we focus on using many virtual environments for inexpensive, automatically collected interaction data. Genie, a recent multi-environment world model, demonstrates simulation abilities of many environments with shared…
▽ More
Modern world models require costly and time-consuming collection of large video datasets with action demonstrations by people or by environment-specific agents. To simplify training, we focus on using many virtual environments for inexpensive, automatically collected interaction data. Genie, a recent multi-environment world model, demonstrates simulation abilities of many environments with shared behavior. Unfortunately, training their model requires expensive demonstrations. Therefore, we propose a training framework merely using a random agent in virtual environments. While the model trained in this manner exhibits good controls, it is limited by the random exploration possibilities. To address this limitation, we propose AutoExplore Agent - an exploration agent that entirely relies on the uncertainty of the world model, delivering diverse data from which it can learn the best. Our agent is fully independent of environment-specific rewards and thus adapts easily to new environments. With this approach, the pretrained multi-environment model can quickly adapt to new environments achieving video fidelity and controllability improvement. In order to obtain automatically large-scale interaction datasets for pretraining, we group environments with similar behavior and controls. To this end, we annotate the behavior and controls of 974 virtual environments - a dataset that we name RetroAct. For building our model, we first create an open implementation of Genie - GenieRedux and apply enhancements and adaptations in our version GenieRedux-G. Our code and data are available at https://github.com/insait-institute/GenieRedux.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
SIGHT: Synthesizing Image-Text Conditioned and Geometry-Guided 3D Hand-Object Trajectories
Authors:
Alexey Gavryushin,
Alexandros Delitzas,
Luc Van Gool,
Marc Pollefeys,
Kaichun Mo,
Xi Wang
Abstract:
When humans grasp an object, they naturally form trajectories in their minds to manipulate it for specific tasks. Modeling hand-object interaction priors holds significant potential to advance robotic and embodied AI systems in learning to operate effectively within the physical world. We introduce SIGHT, a novel task focused on generating realistic and physically plausible 3D hand-object interact…
▽ More
When humans grasp an object, they naturally form trajectories in their minds to manipulate it for specific tasks. Modeling hand-object interaction priors holds significant potential to advance robotic and embodied AI systems in learning to operate effectively within the physical world. We introduce SIGHT, a novel task focused on generating realistic and physically plausible 3D hand-object interaction trajectories from a single image and a brief language-based task description. Prior work on hand-object trajectory generation typically relies on textual input that lacks explicit grounding to the target object, or assumes access to 3D object meshes, which are often considerably more difficult to obtain than 2D images. We propose SIGHT-Fusion, a novel diffusion-based image-text conditioned generative model that tackles this task by retrieving the most similar 3D object mesh from a database and enforcing geometric hand-object interaction constraints via a novel inference-time diffusion guidance. We benchmark our model on the HOI4D and H2O datasets, adapting relevant baselines for this novel task. Experiments demonstrate our superior performance in the diversity and quality of generated trajectories, as well as in hand-object interaction geometry metrics.
△ Less
Submitted 29 May, 2025; v1 submitted 28 March, 2025;
originally announced March 2025.
-
Generalizable Implicit Neural Representations via Parameterized Latent Dynamics for Baroclinic Ocean Forecasting
Authors:
Guang Zhao,
Xihaier Luo,
Seungjun Lee,
Yihui Ren,
Shinjae Yoo,
Luke Van Roekel,
Balu Nadiga,
Sri Hari Krishna Narayanan,
Yixuan Sun,
Wei Xu
Abstract:
Mesoscale ocean dynamics play a critical role in climate systems, governing heat transport, hurricane genesis, and drought patterns. However, simulating these processes at high resolution remains computationally prohibitive due to their nonlinear, multiscale nature and vast spatiotemporal domains. Implicit neural representations (INRs) reduce the computational costs as resolution-independent surro…
▽ More
Mesoscale ocean dynamics play a critical role in climate systems, governing heat transport, hurricane genesis, and drought patterns. However, simulating these processes at high resolution remains computationally prohibitive due to their nonlinear, multiscale nature and vast spatiotemporal domains. Implicit neural representations (INRs) reduce the computational costs as resolution-independent surrogates but fail in many-query scenarios (inverse modeling) requiring rapid evaluations across diverse parameters. We present PINROD, a novel framework combining dynamics-aware implicit neural representations with parameterized neural ordinary differential equations to address these limitations. By integrating parametric dependencies into latent dynamics, our method efficiently captures nonlinear oceanic behavior across varying boundary conditions and physical parameters. Experiments on ocean mesoscale activity data show superior accuracy over existing baselines and improved computational efficiency compared to standard numerical simulations.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness
Authors:
Chenfei Liao,
Kaiyu Lei,
Xu Zheng,
Junha Moon,
Zhixiong Wang,
Yixuan Wang,
Danda Pani Paudel,
Luc Van Gool,
Xuming Hu
Abstract:
Multi-modal semantic segmentation (MMSS) addresses the limitations of single-modality data by integrating complementary information across modalities. Despite notable progress, a significant gap persists between research and real-world deployment due to variability and uncertainty in multi-modal data quality. Robustness has thus become essential for practical MMSS applications. However, the absenc…
▽ More
Multi-modal semantic segmentation (MMSS) addresses the limitations of single-modality data by integrating complementary information across modalities. Despite notable progress, a significant gap persists between research and real-world deployment due to variability and uncertainty in multi-modal data quality. Robustness has thus become essential for practical MMSS applications. However, the absence of standardized benchmarks for evaluating robustness hinders further advancement. To address this, we first survey existing MMSS literature and categorize representative methods to provide a structured overview. We then introduce a robustness benchmark that evaluates MMSS models under three scenarios: Entire-Missing Modality (EMM), Random-Missing Modality (RMM), and Noisy Modality (NM). From a probabilistic standpoint, we model modality failure under two conditions: (1) all damaged combinations are equally probable; (2) each modality fails independently following a Bernoulli distribution. Based on these, we propose four metrics-$mIoU^{Avg}_{EMM}$, $mIoU^{E}_{EMM}$, $mIoU^{Avg}_{RMM}$, and $mIoU^{E}_{RMM}$-to assess model robustness under EMM and RMM. This work provides the first dedicated benchmark for MMSS robustness, offering new insights and tools to advance the field. Source code is available at https://github.com/Chenfei-Liao/Multi-Modal-Semantic-Segmentation-Robustness-Benchmark.
△ Less
Submitted 10 April, 2025; v1 submitted 24 March, 2025;
originally announced March 2025.
-
SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining
Authors:
Yue Li,
Qi Ma,
Runyi Yang,
Huapeng Li,
Mengjiao Ma,
Bin Ren,
Nikola Popovic,
Nicu Sebe,
Ender Konukoglu,
Theo Gevers,
Luc Van Gool,
Martin R. Oswald,
Danda Pani Paudel
Abstract:
Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training or together at inference. This highlights the clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Mean…
▽ More
Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training or together at inference. This highlights the clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Meanwhile, 3D Gaussian Splatting (3DGS) has emerged as the de facto standard for 3D scene representation across various vision tasks. However, effectively integrating semantic reasoning into 3DGS in a generalizable manner remains an open challenge. To address these limitations, we introduce SceneSplat, to our knowledge the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS. Furthermore, we propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. To power the proposed methods, we introduce SceneSplat-7K, the first large-scale 3DGS dataset for indoor scenes, comprising 7916 scenes derived from seven established datasets, such as ScanNet and Matterport3D. Generating SceneSplat-7K required computational resources equivalent to 150 GPU days on an L4 GPU, enabling standardized benchmarking for 3DGS-based reasoning for indoor scenes. Our exhaustive experiments on SceneSplat-7K demonstrate the significant benefit of the proposed method over the established baselines.
△ Less
Submitted 3 June, 2025; v1 submitted 23 March, 2025;
originally announced March 2025.
-
Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook
Authors:
Xu Zheng,
Ziqiao Weng,
Yuanhuiyi Lyu,
Lutao Jiang,
Haiwei Xue,
Bin Ren,
Danda Paudel,
Nicu Sebe,
Luc Van Gool,
Xuming Hu
Abstract:
Retrieval-augmented generation (RAG) has emerged as a pivotal technique in artificial intelligence (AI), particularly in enhancing the capabilities of large language models (LLMs) by enabling access to external, reliable, and up-to-date knowledge sources. In the context of AI-Generated Content (AIGC), RAG has proven invaluable by augmenting model outputs with supplementary, relevant information, t…
▽ More
Retrieval-augmented generation (RAG) has emerged as a pivotal technique in artificial intelligence (AI), particularly in enhancing the capabilities of large language models (LLMs) by enabling access to external, reliable, and up-to-date knowledge sources. In the context of AI-Generated Content (AIGC), RAG has proven invaluable by augmenting model outputs with supplementary, relevant information, thus improving their quality. Recently, the potential of RAG has extended beyond natural language processing, with emerging methods integrating retrieval-augmented strategies into the computer vision (CV) domain. These approaches aim to address the limitations of relying solely on internal model knowledge by incorporating authoritative external knowledge bases, thereby improving both the understanding and generation capabilities of vision models. This survey provides a comprehensive review of the current state of retrieval-augmented techniques in CV, focusing on two main areas: (I) visual understanding and (II) visual generation. In the realm of visual understanding, we systematically review tasks ranging from basic image recognition to complex applications such as medical report generation and multimodal question answering. For visual content generation, we examine the application of RAG in tasks related to image, video, and 3D generation. Furthermore, we explore recent advancements in RAG for embodied AI, with a particular focus on applications in planning, task execution, multimodal perception, interaction, and specialized domains. Given that the integration of retrieval-augmented techniques in CV is still in its early stages, we also highlight the key limitations of current approaches and propose future research directions to drive the development of this promising area.
△ Less
Submitted 23 March, 2025;
originally announced March 2025.
-
UniK3D: Universal Camera Monocular 3D Estimation
Authors:
Luigi Piccinelli,
Christos Sakaridis,
Mattia Segu,
Yung-Hsu Yang,
Siyuan Li,
Wim Abbeloos,
Luc Van Gool
Abstract:
Monocular 3D estimation is crucial for visual perception. However, current methods fall short by relying on oversimplified assumptions, such as pinhole camera models or rectified images. These limitations severely restrict their general applicability, causing poor performance in real-world scenarios with fisheye or panoramic images and resulting in substantial context loss. To address this, we pre…
▽ More
Monocular 3D estimation is crucial for visual perception. However, current methods fall short by relying on oversimplified assumptions, such as pinhole camera models or rectified images. These limitations severely restrict their general applicability, causing poor performance in real-world scenarios with fisheye or panoramic images and resulting in substantial context loss. To address this, we present UniK3D, the first generalizable method for monocular 3D estimation able to model any camera. Our method introduces a spherical 3D representation which allows for better disentanglement of camera and scene geometry and enables accurate metric 3D reconstruction for unconstrained camera models. Our camera component features a novel, model-independent representation of the pencil of rays, achieved through a learned superposition of spherical harmonics. We also introduce an angular loss, which, together with the camera module design, prevents the contraction of the 3D outputs for wide-view cameras. A comprehensive zero-shot evaluation on 13 diverse datasets demonstrates the state-of-the-art performance of UniK3D across 3D, depth, and camera metrics, with substantial gains in challenging large-field-of-view and panoramic settings, while maintaining top accuracy in conventional pinhole small-field-of-view domains. Code and models are available at github.com/lpiccinelli-eth/unik3d .
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures
Authors:
Lucas Morin,
Valéry Weber,
Ahmed Nassar,
Gerhard Ingmar Meijer,
Luc Van Gool,
Yawei Li,
Peter Staar
Abstract:
The automated analysis of chemical literature holds promise to accelerate discovery in fields such as material science and drug development. In particular, search capabilities for chemical structures and Markush structures (chemical structure templates) within patent documents are valuable, e.g., for prior-art search. Advancements have been made in the automatic extraction of chemical structures f…
▽ More
The automated analysis of chemical literature holds promise to accelerate discovery in fields such as material science and drug development. In particular, search capabilities for chemical structures and Markush structures (chemical structure templates) within patent documents are valuable, e.g., for prior-art search. Advancements have been made in the automatic extraction of chemical structures from text and images, yet the Markush structures remain largely unexplored due to their complex multi-modal nature. In this work, we present MarkushGrapher, a multi-modal approach for recognizing Markush structures in documents. Our method jointly encodes text, image, and layout information through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. These representations are merged and used to auto-regressively generate a sequential graph representation of the Markush structure along with a table defining its variable groups. To overcome the lack of real-world training data, we propose a synthetic data generation pipeline that produces a wide range of realistic Markush structures. Additionally, we present M2S, the first annotated benchmark of real-world Markush structures, to advance research on this challenging task. Extensive experiments demonstrate that our approach outperforms state-of-the-art chemistry-specific and general-purpose vision-language models in most evaluation settings. Code, models, and datasets will be available.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
Physics-based AI methodology for Material Parameter Extraction from Optical Data
Authors:
M. Koumans,
J. L. M. van Mechelen
Abstract:
We report on a novel methodology for extracting material parameters from spectroscopic optical data using a physics-based neural network. The proposed model integrates classical optimization frameworks with a multi-scale object detection framework, specifically exploring the effect of incorporating physics into the neural network. We validate and analyze its performance on simulated transmission s…
▽ More
We report on a novel methodology for extracting material parameters from spectroscopic optical data using a physics-based neural network. The proposed model integrates classical optimization frameworks with a multi-scale object detection framework, specifically exploring the effect of incorporating physics into the neural network. We validate and analyze its performance on simulated transmission spectra at terahertz and infrared frequencies. Compared to traditional model-based approaches, our method is designed to be autonomous, robust, and time-efficient, making it particularly relevant for industrial and societal applications.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
Lawful and Accountable Personal Data Processing with GDPR-based Access and Usage Control in Distributed Systems
Authors:
L. Thomas van Binsbergen,
Marten C. Steketee,
Milen G. Kebede,
Heleen L. Janssen,
Tom M. van Engers
Abstract:
Compliance with the GDPR privacy regulation places a significant burden on organisations regarding the handling of personal data. The perceived efforts and risks of complying with the GDPR further increase when data processing activities span across organisational boundaries, as is the case in both small-scale data sharing settings and in large-scale international data spaces.
This paper address…
▽ More
Compliance with the GDPR privacy regulation places a significant burden on organisations regarding the handling of personal data. The perceived efforts and risks of complying with the GDPR further increase when data processing activities span across organisational boundaries, as is the case in both small-scale data sharing settings and in large-scale international data spaces.
This paper addresses these concerns by proposing a case-generic method for automated normative reasoning that establishes legal arguments for the lawfulness of data processing activities. The arguments are established on the basis of case-specific legal qualifications made by privacy experts, bringing the human in the loop. The obtained expert system promotes transparency and accountability, remains adaptable to extended or altered interpretations of the GDPR, and integrates into novel or existing distributed data processing systems.
This result is achieved by defining a formal ontology and semantics for automated normative reasoning based on an analysis of the purpose-limitation principle of the GDPR. The ontology and semantics are implemented in eFLINT, a domain-specific language for specifying and reasoning with norms. The XACML architecture standard, applicable to both access and usage control, is extended, demonstrating how GDPR-based normative reasoning can integrate into (existing, distributed) systems for data processing. The resulting system is designed and critically assessed in reference to requirements extracted from the GPDR.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.