Skip to main content

Showing 1–50 of 107 results for author: Paudel, D

.
  1. arXiv:2506.08710  [pdf, ps, other

    cs.CV

    SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting

    Authors: Mengjiao Ma, Qi Ma, Yue Li, Jiahuan Cheng, Runyi Yang, Bin Ren, Nikola Popovic, Mingqiang Wei, Nicu Sebe, Luc Van Gool, Theo Gevers, Martin R. Oswald, Danda Pani Paudel

    Abstract: 3D Gaussian Splatting (3DGS) serves as a highly performant and efficient encoding of scene geometry, appearance, and semantics. Moreover, grounding language in 3D scenes has proven to be an effective strategy for 3D scene understanding. Current Language Gaussian Splatting line of work fall into three main groups: (i) per-scene optimization-based, (ii) per-scene optimization-free, and (iii) general… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: 15 pages, codes, data and benchmark will be released

  2. arXiv:2506.05872  [pdf, ps, other

    cs.CV

    Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection

    Authors: Yu Li, Xingyu Qiu, Yuqian Fu, Jie Chen, Tianwen Qian, Xu Zheng, Danda Pani Paudel, Yanwei Fu, Xuanjing Huang, Luc Van Gool, Yu-Gang Jiang

    Abstract: Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, their effectiveness for CD-FSOD remains unclear due to the need for both visual realism and domain alignment. Existing strategies, such as copy-paste augmentation… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  3. arXiv:2506.05856  [pdf, ps, other

    cs.CV cs.AI

    Cross-View Multi-Modal Segmentation @ Ego-Exo4D Challenges 2025

    Authors: Yuqian Fu, Runze Wang, Yanwei Fu, Danda Pani Paudel, Luc Van Gool

    Abstract: In this report, we present a cross-view multi-modal object segmentation approach for the object correspondence task in the Ego-Exo4D Correspondence Challenges 2025. Given object queries from one perspective (e.g., ego view), the goal is to predict the corresponding object masks in another perspective (e.g., exo view). To tackle this task, we propose a multimodal condition fusion module that enhanc… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: The 2nd Price Award of EgoExo4D Relations, Second Joint EgoVis Workshop with CVPR2025, technical report paper is accepted by CVPRW 25

  4. arXiv:2506.03675  [pdf, ps, other

    cs.CV

    BiXFormer: A Robust Framework for Maximizing Modality Effectiveness in Multi-Modal Semantic Segmentation

    Authors: Jialei Chen, Xu Zheng, Danda Pani Paudel, Luc Van Gool, Hiroshi Murase, Daisuke Deguchi

    Abstract: Utilizing multi-modal data enhances scene understanding by providing complementary semantic and geometric information. Existing methods fuse features or distill knowledge from multiple modalities into a unified representation, improving robustness but restricting each modality's ability to fully leverage its strengths in different situations. We reformulate multi-modal semantic segmentation as a m… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  5. arXiv:2506.01667  [pdf, other

    cs.CV

    EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models

    Authors: Yan Shu, Bin Ren, Zhitong Xiong, Danda Pani Paudel, Luc Van Gool, Begum Demir, Nicu Sebe, Paolo Rota

    Abstract: Large Multimodal Models (LMMs) have demonstrated strong performance in various vision-language tasks. However, they often struggle to comprehensively understand Earth Observation (EO) data, which is critical for monitoring the environment and the effects of human activity on it. In this work, we present EarthMind, a novel vision-language framework for multi-granular and multi-sensor EO data unders… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  6. arXiv:2505.22246  [pdf, ps, other

    cs.CV

    StateSpaceDiffuser: Bringing Long Context to Diffusion World Models

    Authors: Nedko Savov, Naser Kazemi, Deheng Zhang, Danda Pani Paudel, Xi Wang, Luc Van Gool

    Abstract: World models have recently become promising tools for predicting realistic visuals based on actions in complex environments. However, their reliance on only a few recent observations leads them to lose track of the long-term context. Consequently, in just a few steps the generated scenes drift from what was previously observed, undermining the temporal coherence of the sequence. This limitation of… ▽ More

    Submitted 26 June, 2025; v1 submitted 28 May, 2025; originally announced May 2025.

  7. arXiv:2505.18679  [pdf, ps, other

    cs.CV

    Manifold-aware Representation Learning for Degradation-agnostic Image Restoration

    Authors: Bin Ren, Yawei Li, Xu Zheng, Yuqian Fu, Danda Pani Paudel, Ming-Hsuan Yang, Luc Van Gool, Nicu Sebe

    Abstract: Image Restoration (IR) aims to recover high quality images from degraded inputs affected by various corruptions such as noise, blur, haze, rain, and low light conditions. Despite recent advances, most existing approaches treat IR as a direct mapping problem, relying on shared representations across degradation types without modeling their structural diversity. In this work, we present MIRAGE, a un… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

    Comments: ALl-in-One Image Restoration, low-level vision

  8. arXiv:2505.18657  [pdf, ps, other

    cs.AI

    MLLMs are Deeply Affected by Modality Bias

    Authors: Xu Zheng, Chenfei Liao, Yuqian Fu, Kaiyu Lei, Yuanhuiyi Lyu, Lutao Jiang, Bin Ren, Jialei Chen, Jiawen Wang, Chengxin Li, Linfeng Zhang, Danda Pani Paudel, Xuanjing Huang, Yu-Gang Jiang, Nicu Sebe, Dacheng Tao, Luc Van Gool, Xuming Hu

    Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have shown promising results in integrating diverse modalities such as texts and images. MLLMs are heavily influenced by modality bias, often relying on language while under-utilizing other modalities like visual inputs. This position paper argues that MLLMs are deeply affected by modality bias. Firstly, we diagnose the current state of m… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  9. arXiv:2505.11907  [pdf, ps, other

    cs.CV

    Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?

    Authors: Zihao Dongfang, Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Danda Pani Paudel, Luc Van Gool, Kailun Yang, Xuming Hu

    Abstract: The 180x360 omnidirectional field of view captured by 360-degree cameras enables their use in a wide range of applications such as embodied AI and virtual reality. Although recent advances in multimodal large language models (MLLMs) have shown promise in visual-spatial reasoning, most studies focus on standard pinhole-view images, leaving omnidirectional perception largely unexplored. In this pape… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

  10. arXiv:2505.06635  [pdf, ps, other

    cs.CV

    Reducing Unimodal Bias in Multi-Modal Semantic Segmentation with Multi-Scale Functional Entropy Regularization

    Authors: Xu Zheng, Yuanhuiyi Lyu, Lutao Jiang, Danda Pani Paudel, Luc Van Gool, Xuming Hu

    Abstract: Fusing and balancing multi-modal inputs from novel sensors for dense prediction tasks, particularly semantic segmentation, is critically important yet remains a significant challenge. One major limitation is the tendency of multi-modal frameworks to over-rely on easily learnable modalities, a phenomenon referred to as unimodal dominance or bias. This issue becomes especially problematic in real-wo… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  11. arXiv:2505.05023  [pdf, other

    cs.CV

    Split Matching for Inductive Zero-shot Semantic Segmentation

    Authors: Jialei Chen, Xu Zheng, Dongyue Li, Chong Yi, Seigo Ito, Danda Pani Paudel, Luc Van Gool, Hiroshi Murase, Daisuke Deguchi

    Abstract: Zero-shot Semantic Segmentation (ZSS) aims to segment categories that are not annotated during training. While fine-tuning vision-language models has achieved promising results, these models often overfit to seen categories due to the lack of supervision for unseen classes. As an alternative to fully supervised approaches, query-based segmentation has shown great latent in ZSS, as it enables objec… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  12. arXiv:2504.14249  [pdf, other

    cs.CV

    Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation

    Authors: Bin Ren, Eduard Zamfir, Zongwei Wu, Yawei Li, Yidi Li, Danda Pani Paudel, Radu Timofte, Ming-Hsuan Yang, Luc Van Gool, Nicu Sebe

    Abstract: Restoring any degraded image efficiently via just one model has become increasingly significant and impactful, especially with the proliferation of mobile devices. Traditional solutions typically involve training dedicated models per degradation, resulting in inefficiency and redundancy. More recent approaches either introduce additional modules to learn visual prompts, significantly increasing mo… ▽ More

    Submitted 19 April, 2025; originally announced April 2025.

    Comments: Efficient All in One Image Restoration

  13. arXiv:2504.12401  [pdf, other

    cs.CV

    NTIRE 2025 Challenge on Event-Based Image Deblurring: Methods and Results

    Authors: Lei Sun, Andrea Alfarano, Peiqi Duan, Shaolin Su, Kaiwei Wang, Boxin Shi, Radu Timofte, Danda Pani Paudel, Luc Van Gool, Qinglin Liu, Wei Yu, Xiaoqian Lv, Lu Yang, Shuigen Wang, Shengping Zhang, Xiangyang Ji, Long Bao, Yuqiang Yang, Jinao Song, Ziyi Wang, Shuang Wen, Heng Sun, Kean Liu, Mingchen Zhong, Senyan Xu , et al. (63 additional authors not shown)

    Abstract: This paper presents an overview of NTIRE 2025 the First Challenge on Event-Based Image Deblurring, detailing the proposed methodologies and corresponding results. The primary goal of the challenge is to design an event-based method that achieves high-quality image deblurring, with performance quantitatively assessed using Peak Signal-to-Noise Ratio (PSNR). Notably, there are no restrictions on com… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  14. arXiv:2504.09379  [pdf, other

    cs.CV

    Low-Light Image Enhancement using Event-Based Illumination Estimation

    Authors: Lei Sun, Yuhan Bao, Jiajun Zhai, Jingyun Liang, Yulun Zhang, Kaiwei Wang, Danda Pani Paudel, Luc Van Gool

    Abstract: Low-light image enhancement (LLIE) aims to improve the visibility of images captured in poorly lit environments. Prevalent event-based solutions primarily utilize events triggered by motion, i.e., ''motion events'' to strengthen only the edge texture, while leaving the high dynamic range and excellent low-light responsiveness of event cameras largely unexplored. This paper instead opens a new aven… ▽ More

    Submitted 12 April, 2025; originally announced April 2025.

  15. arXiv:2504.02515  [pdf, other

    cs.CV

    Exploration-Driven Generative Interactive Environments

    Authors: Nedko Savov, Naser Kazemi, Mohammad Mahdi, Danda Pani Paudel, Xi Wang, Luc Van Gool

    Abstract: Modern world models require costly and time-consuming collection of large video datasets with action demonstrations by people or by environment-specific agents. To simplify training, we focus on using many virtual environments for inexpensive, automatically collected interaction data. Genie, a recent multi-environment world model, demonstrates simulation abilities of many environments with shared… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

    Comments: Accepted at CVPR 2025

  16. arXiv:2503.18445  [pdf, other

    cs.CV

    Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness

    Authors: Chenfei Liao, Kaiyu Lei, Xu Zheng, Junha Moon, Zhixiong Wang, Yixuan Wang, Danda Pani Paudel, Luc Van Gool, Xuming Hu

    Abstract: Multi-modal semantic segmentation (MMSS) addresses the limitations of single-modality data by integrating complementary information across modalities. Despite notable progress, a significant gap persists between research and real-world deployment due to variability and uncertainty in multi-modal data quality. Robustness has thus become essential for practical MMSS applications. However, the absenc… ▽ More

    Submitted 10 April, 2025; v1 submitted 24 March, 2025; originally announced March 2025.

    Comments: This paper has been accepted by the CVPR 2025 Workshop: TMM-OpenWorld as an oral presentation paper

  17. arXiv:2503.18052  [pdf, ps, other

    cs.CV

    SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

    Authors: Yue Li, Qi Ma, Runyi Yang, Huapeng Li, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Ender Konukoglu, Theo Gevers, Luc Van Gool, Martin R. Oswald, Danda Pani Paudel

    Abstract: Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training or together at inference. This highlights the clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Mean… ▽ More

    Submitted 3 June, 2025; v1 submitted 23 March, 2025; originally announced March 2025.

    Comments: Our code, model, and dataset will be released at https://unique1i.github.io/SceneSplat_webpage/

  18. arXiv:2503.18016  [pdf, other

    cs.CV

    Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook

    Authors: Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Lutao Jiang, Haiwei Xue, Bin Ren, Danda Paudel, Nicu Sebe, Luc Van Gool, Xuming Hu

    Abstract: Retrieval-augmented generation (RAG) has emerged as a pivotal technique in artificial intelligence (AI), particularly in enhancing the capabilities of large language models (LLMs) by enabling access to external, reliable, and up-to-date knowledge sources. In the context of AI-Generated Content (AIGC), RAG has proven invaluable by augmenting model outputs with supplementary, relevant information, t… ▽ More

    Submitted 23 March, 2025; originally announced March 2025.

    Comments: 19 pages, 10 figures

  19. arXiv:2502.10012  [pdf, other

    cs.AI cs.RO

    Dream to Drive: Model-Based Vehicle Control Using Analytic World Models

    Authors: Asen Nachkov, Danda Pani Paudel, Jan-Nico Zaech, Davide Scaramuzza, Luc Van Gool

    Abstract: Differentiable simulators have recently shown great promise for training autonomous vehicle controllers. Being able to backpropagate through them, they can be placed into an end-to-end training loop where their known dynamics turn into useful priors for the policy to learn, removing the typical black box assumption of the environment. So far, these systems have only been used to train policies. Ho… ▽ More

    Submitted 14 February, 2025; originally announced February 2025.

  20. arXiv:2501.08982  [pdf, other

    cs.CV

    CityLoc: 6DoF Pose Distributional Localization for Text Descriptions in Large-Scale Scenes with Gaussian Representation

    Authors: Qi Ma, Runyi Yang, Bin Ren, Nicu Sebe, Ender Konukoglu, Luc Van Gool, Danda Pani Paudel

    Abstract: Localizing textual descriptions within large-scale 3D scenes presents inherent ambiguities, such as identifying all traffic lights in a city. Addressing this, we introduce a method to generate distributions of camera poses conditioned on textual descriptions, facilitating robust reasoning for broadly defined concepts. Our approach employs a diffusion-based architecture to refine noisy 6DoF camer… ▽ More

    Submitted 3 February, 2025; v1 submitted 15 January, 2025; originally announced January 2025.

  21. arXiv:2412.01807  [pdf, other

    cs.CV

    Occam's LGS: An Efficient Approach for Language Gaussian Splatting

    Authors: Jiahuan Cheng, Jan-Nico Zaech, Luc Van Gool, Danda Pani Paudel

    Abstract: TL;DR: Gaussian Splatting is a widely adopted approach for 3D scene representation, offering efficient, high-quality reconstruction and rendering. A key reason for its success is the simplicity of representing scenes with sets of Gaussians, making it interpretable and adaptable. To enhance understanding beyond visual representation, recent approaches extend Gaussian Splatting with semantic vision-… ▽ More

    Submitted 8 March, 2025; v1 submitted 2 December, 2024; originally announced December 2024.

    Comments: Project Page: https://insait-institute.github.io/OccamLGS/

  22. arXiv:2412.01398  [pdf, other

    cs.CV cs.RO

    Holistic Understanding of 3D Scenes as Universal Scene Description

    Authors: Anna-Maria Halacheva, Yang Miao, Jan-Nico Zaech, Xi Wang, Luc Van Gool, Danda Pani Paudel

    Abstract: 3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI. Providing a solution to these applications requires a multifaceted approach that covers scene-centric, object-centric, as well as interaction-centric capabilities. While there exist numerous datasets approaching the former two problems, the task… ▽ More

    Submitted 2 December, 2024; originally announced December 2024.

  23. arXiv:2412.01370  [pdf, other

    cs.CV cs.CL

    Understanding the World's Museums through Vision-Language Reasoning

    Authors: Ada-Astrid Balauca, Sanjana Garai, Stefan Balauca, Rasesh Udayakumar Shetty, Naitik Agrawal, Dhwanil Subhashbhai Shah, Yuqian Fu, Xi Wang, Kristina Toutanova, Danda Pani Paudel, Luc Van Gool

    Abstract: Museums serve as vital repositories of cultural heritage and historical artifacts spanning diverse epochs, civilizations, and regions, preserving well-documented collections. Data reveal key attributes such as age, origin, material, and cultural significance. Understanding museum exhibits from their images requires reasoning beyond visual features. In this work, we facilitate such reasoning by (a)… ▽ More

    Submitted 2 December, 2024; originally announced December 2024.

  24. arXiv:2411.19083  [pdf, other

    cs.CV cs.AI

    ObjectRelator: Enabling Cross-View Object Relation Understanding in Ego-Centric and Exo-Centric Videos

    Authors: Yuqian Fu, Runze Wang, Yanwei Fu, Danda Pani Paudel, Xuanjing Huang, Luc Van Gool

    Abstract: In this paper, we focus on the Ego-Exo Object Correspondence task, an emerging challenge in the field of computer vision that aims to map objects across ego-centric and exo-centric views. We introduce ObjectRelator, a novel method designed to tackle this task, featuring two new modules: Multimodal Condition Fusion (MCFuse) and SSL-based Cross-View Object Alignment (XObjAlign). MCFuse effectively f… ▽ More

    Submitted 28 November, 2024; originally announced November 2024.

  25. arXiv:2411.18466  [pdf, other

    cs.CV

    Complexity Experts are Task-Discriminative Learners for Any Image Restoration

    Authors: Eduard Zamfir, Zongwei Wu, Nancy Mehta, Yuedong Tan, Danda Pani Paudel, Yulun Zhang, Radu Timofte

    Abstract: Recent advancements in all-in-one image restoration models have revolutionized the ability to address diverse degradations through a unified framework. However, parameters tied to specific tasks often remain inactive for other tasks, making mixture-of-experts (MoE) architectures a natural extension. Despite this, MoEs often show inconsistent behavior, with some experts unexpectedly generalizing ac… ▽ More

    Submitted 13 March, 2025; v1 submitted 27 November, 2024; originally announced November 2024.

    Comments: Accepted at CVPR 2025

  26. arXiv:2411.16804  [pdf, other

    cs.CV

    InTraGen: Trajectory-controlled Video Generation for Object Interactions

    Authors: Zuhao Liu, Aleksandar Yanev, Ahmad Mahmood, Ivan Nikolov, Saman Motamed, Wei-Shi Zheng, Xi Wang, Luc Van Gool, Danda Pani Paudel

    Abstract: Advances in video generation have significantly improved the realism and quality of created scenes. This has fueled interest in developing intuitive tools that let users leverage video generation as world simulators. Text-to-video (T2V) generation is one such approach, enabling video creation from text descriptions only. Yet, due to the inherent ambiguity in texts and the limited temporal informat… ▽ More

    Submitted 25 November, 2024; originally announced November 2024.

  27. arXiv:2411.15018  [pdf, other

    cs.CV

    Neural 4D Evolution under Large Topological Changes from 2D Images

    Authors: AmirHossein Naghi Razlighi, Tiago Novello, Asen Nachkov, Thomas Probst, Danda Paudel

    Abstract: In the literature, it has been shown that the evolution of the known explicit 3D surface to the target one can be learned from 2D images using the instantaneous flow field, where the known and target 3D surfaces may largely differ in topology. We are interested in capturing 4D shapes whose topology changes largely over time. We encounter that the straightforward extension of the existing 3D-based… ▽ More

    Submitted 22 November, 2024; originally announced November 2024.

    Comments: 15 pages, 21 figures

    ACM Class: I.4.5; I.3.5

  28. arXiv:2411.13040  [pdf, other

    cs.CV

    RobustFormer: Noise-Robust Pre-training for images and videos

    Authors: Ashish Bastola, Nishant Luitel, Hao Wang, Danda Pani Paudel, Roshani Poudel, Abolfazl Razi

    Abstract: While deep learning models are powerful tools that revolutionized many areas, they are also vulnerable to noise as they rely heavily on learning patterns and features from the exact details of the clean data. Transformers, which have become the backbone of modern vision models, are no exception. Current Discrete Wavelet Transforms (DWT) based methods do not benefit from masked autoencoder (MAE) pr… ▽ More

    Submitted 20 November, 2024; originally announced November 2024.

    Comments: 13 pages

  29. arXiv:2410.03812  [pdf, other

    cs.CV

    EvenNICER-SLAM: Event-based Neural Implicit Encoding SLAM

    Authors: Shi Chen, Danda Pani Paudel, Luc Van Gool

    Abstract: The advancement of dense visual simultaneous localization and mapping (SLAM) has been greatly facilitated by the emergence of neural implicit representations. Neural implicit encoding SLAM, a typical example of which is NICE-SLAM, has recently demonstrated promising results in large-scale indoor scenes. However, these methods typically rely on temporally dense RGB-D image streams as input in order… ▽ More

    Submitted 4 October, 2024; originally announced October 2024.

  30. arXiv:2409.15250  [pdf, other

    cs.CV cs.RO

    ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models

    Authors: Sombit Dey, Jan-Nico Zaech, Nikolay Nikolov, Luc Van Gool, Danda Pani Paudel

    Abstract: Recent progress in large language models and access to large-scale robotic datasets has sparked a paradigm shift in robotics models transforming them into generalists able to adapt to various tasks, scenes, and robot modalities. A large step for the community are open Vision Language Action models which showcase strong performance in a wide variety of tasks. In this work, we study the visual gener… ▽ More

    Submitted 20 May, 2025; v1 submitted 23 September, 2024; originally announced September 2024.

    Comments: Accepted at ICRA-2025, Atlanta

  31. arXiv:2409.07965  [pdf, other

    cs.AI cs.RO

    Autonomous Vehicle Controllers From End-to-End Differentiable Simulation

    Authors: Asen Nachkov, Danda Pani Paudel, Luc Van Gool

    Abstract: Current methods to learn controllers for autonomous vehicles (AVs) focus on behavioural cloning. Being trained only on exact historic data, the resulting agents often generalize poorly to novel scenarios. Simulators provide the opportunity to go beyond offline datasets, but they are still treated as complicated black boxes, only used to update the global simulation state. As a result, these RL alg… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

  32. arXiv:2409.06445  [pdf, other

    cs.CV cs.AI

    Learning Generative Interactive Environments By Trained Agent Exploration

    Authors: Naser Kazemi, Nedko Savov, Danda Paudel, Luc Van Gool

    Abstract: World models are increasingly pivotal in interpreting and simulating the rules and actions of complex environments. Genie, a recent model, excels at learning from visually diverse environments but relies on costly human-collected data. We observe that their alternative method of using random agents is too limited to explore the environment. We propose to improve the model by employing reinforcemen… ▽ More

    Submitted 18 October, 2024; v1 submitted 10 September, 2024; originally announced September 2024.

  33. arXiv:2409.01690  [pdf, other

    cs.CV cs.CL

    Taming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits

    Authors: Ada-Astrid Balauca, Danda Pani Paudel, Kristina Toutanova, Luc Van Gool

    Abstract: CLIP is a powerful and widely used tool for understanding images in the context of natural language descriptions to perform nuanced tasks. However, it does not offer application-specific fine-grained and structured understanding, due to its generic nature. In this work, we aim to adapt CLIP for fine-grained and structured -- in the form of tabular data -- visual understanding of museum exhibits. T… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

    Comments: Accepted to ECCV 2024

  34. arXiv:2408.16504  [pdf, other

    cs.CV

    A Simple and Generalist Approach for Panoptic Segmentation

    Authors: Nedyalko Prisadnikov, Wouter Van Gansbeke, Danda Pani Paudel, Luc Van Gool

    Abstract: Panoptic segmentation is an important computer vision task, where the current state-of-the-art solutions require specialized components to perform well. We propose a simple generalist framework based on a deep encoder - shallow decoder architecture with per-pixel prediction. Essentially fine-tuning a massively pretrained image model with minimal additional components. Naively this method does not… ▽ More

    Submitted 7 March, 2025; v1 submitted 29 August, 2024; originally announced August 2024.

  35. arXiv:2408.10906  [pdf, other

    cs.CV

    ShapeSplat: A Large-scale Dataset of Gaussian Splats and Their Self-Supervised Pretraining

    Authors: Qi Ma, Yue Li, Bin Ren, Nicu Sebe, Ender Konukoglu, Theo Gevers, Luc Van Gool, Danda Pani Paudel

    Abstract: 3D Gaussian Splatting (3DGS) has become the de facto method of 3D representation in many vision tasks. This calls for the 3D understanding directly in this representation space. To facilitate the research in this direction, we first build a large-scale dataset of 3DGS using the commonly used ShapeNet and ModelNet datasets. Our dataset ShapeSplat consists of 65K objects from 87 unique categories, w… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

  36. arXiv:2408.09110  [pdf, other

    cs.CV

    Locate Anything on Earth: Advancing Open-Vocabulary Object Detection for Remote Sensing Community

    Authors: Jiancheng Pan, Yanxing Liu, Yuqian Fu, Muyuan Ma, Jiahao Li, Danda Pani Paudel, Luc Van Gool, Xiaomeng Huang

    Abstract: Object detection, particularly open-vocabulary object detection, plays a crucial role in Earth sciences, such as environmental monitoring, natural disaster assessment, and land-use planning. However, existing open-vocabulary detectors, primarily trained on natural-world images, struggle to generalize to remote sensing images due to a significant data domain gap. Thus, this paper aims to advance th… ▽ More

    Submitted 6 March, 2025; v1 submitted 17 August, 2024; originally announced August 2024.

    Comments: 15 pages, 11 figures

  37. arXiv:2407.13372  [pdf, other

    cs.CV

    Restore Anything Model via Efficient Degradation Adaptation

    Authors: Bin Ren, Eduard Zamfir, Zongwei Wu, Yawei Li, Yidi Li, Danda Pani Paudel, Radu Timofte, Ming-Hsuan Yang, Nicu Sebe

    Abstract: With the proliferation of mobile devices, the need for an efficient model to restore any degraded image has become increasingly significant and impactful. Traditional approaches typically involve training dedicated models for each specific degradation, resulting in inefficiency and redundancy. More recent solutions either introduce additional modules to learn visual prompts significantly increasin… ▽ More

    Submitted 18 December, 2024; v1 submitted 18 July, 2024; originally announced July 2024.

    Comments: Efficient Any Image Restoration

  38. arXiv:2407.11174  [pdf, other

    cs.CV cs.AI

    iHuman: Instant Animatable Digital Humans From Monocular Videos

    Authors: Pramish Paudel, Anubhav Khanal, Ajad Chhatkuli, Danda Pani Paudel, Jyoti Tandukar

    Abstract: Personalized 3D avatars require an animatable representation of digital humans. Doing so instantly from monocular videos offers scalability to broad class of users and wide-scale applications. In this paper, we present a fast, simple, yet effective method for creating animatable 3D digital humans from monocular videos. Our method utilizes the efficiency of Gaussian splatting to model both 3D geome… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: 15 pages, eccv, 2024

  39. arXiv:2407.05862  [pdf, other

    cs.CV

    Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning

    Authors: Bin Ren, Guofeng Mei, Danda Pani Paudel, Weijie Wang, Yawei Li, Mengyuan Liu, Rita Cucchiara, Luc Van Gool, Nicu Sebe

    Abstract: Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones. However, in 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant. This raises the question: Can we take the best of both worlds? To answer this question, we first empirically validate that integrating MAE-ba… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning

  40. arXiv:2406.17438  [pdf, other

    cs.CV

    Implicit-Zoo: A Large-Scale Dataset of Neural Implicit Functions for 2D Images and 3D Scenes

    Authors: Qi Ma, Danda Pani Paudel, Ender Konukoglu, Luc Van Gool

    Abstract: Neural implicit functions have demonstrated significant importance in various areas such as computer vision, graphics. Their advantages include the ability to represent complex shapes and scenes with high fidelity, smooth interpolation capabilities, and continuous representations. Despite these benefits, the development and analysis of implicit functions have been limited by the lack of comprehens… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  41. arXiv:2405.17773  [pdf, other

    cs.CV

    XTrack: Multimodal Training Boosts RGB-X Video Object Trackers

    Authors: Yuedong Tan, Zongwei Wu, Yuqian Fu, Zhuyun Zhou, Guolei Sun, Eduard Zamfi, Chao Ma, Danda Pani Paudel, Luc Van Gool, Radu Timofte

    Abstract: Multimodal sensing has proven valuable for visual tracking, as different sensor types offer unique strengths in handling one specific challenging scene where object appearance varies. While a generalist model capable of leveraging all modalities would be ideal, development is hindered by data sparsity, typically in practice, only one modality is available at a time. Therefore, it is crucial to ens… ▽ More

    Submitted 28 November, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

    Comments: 11pages, 5figs

  42. arXiv:2405.15475  [pdf, other

    cs.CV

    Efficient Degradation-aware Any Image Restoration

    Authors: Eduard Zamfir, Zongwei Wu, Nancy Mehta, Danda Pani Paudel, Yulun Zhang, Radu Timofte

    Abstract: Reconstructing missing details from degraded low-quality inputs poses a significant challenge. Recent progress in image restoration has demonstrated the efficacy of learning large models capable of addressing various degradations simultaneously. Nonetheless, these approaches introduce considerable computational overhead and complex learning paradigms, limiting their practical utility. In response,… ▽ More

    Submitted 1 June, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

  43. arXiv:2404.16113  [pdf, other

    eess.SY

    Joint operation of a fast-charging EV hub with a stand-alone independent battery storage system under fairness considerations

    Authors: Diwas Paudel, Luke Wolf, Tapas K. Das

    Abstract: The need for larger-scale fast-charging electric vehicle (EV) hubs is on the rise due to the growth in EV adoption. Another area of power infrastructure growth is the proliferation of independently operated stand-alone battery storage systems (BSS), which is fueled by improvements and cost reductions in battery technology. Many possible uses of the stand-alone BSS are being explored including part… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

  44. arXiv:2404.01243  [pdf, other

    cs.CV

    A Unified and Interpretable Emotion Representation and Expression Generation

    Authors: Reni Paskaleva, Mykyta Holubakha, Andela Ilic, Saman Motamed, Luc Van Gool, Danda Paudel

    Abstract: Canonical emotions, such as happy, sad, and fearful, are easy to understand and annotate. However, emotions are often compound, e.g. happily surprised, and can be mapped to the action units (AUs) used for expressing emotions, and trivially to the canonical ones. Intuitively, emotions are continuous as represented by the arousal-valence (AV) model. An interpretable unification of these four modalit… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: 10 pages, 9 figures, 3 tables Accepted at CVPR 2024. Project page: https://emotion-diffusion.github.io

  45. arXiv:2401.15108  [pdf, other

    cs.LG cs.AI econ.GN eess.SY

    Tacit algorithmic collusion in deep reinforcement learning guided price competition: A study using EV charge pricing game

    Authors: Diwas Paudel, Tapas K. Das

    Abstract: Players in pricing games with complex structures are increasingly adopting artificial intelligence (AI) aided learning algorithms to make pricing decisions for maximizing profits. This is raising concern for the antitrust agencies as the practice of using AI may promote tacit algorithmic collusion among otherwise independent players. Recent studies of games in canonical forms have shown contrastin… ▽ More

    Submitted 10 May, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

  46. arXiv:2312.15242  [pdf, other

    cs.CV

    CaLDiff: Camera Localization in NeRF via Pose Diffusion

    Authors: Rashik Shrestha, Bishad Koju, Abhigyan Bhusal, Danda Pani Paudel, François Rameau

    Abstract: With the widespread use of NeRF-based implicit 3D representation, the need for camera localization in the same representation becomes manifestly apparent. Doing so not only simplifies the localization process -- by avoiding an outside-the-NeRF-based localization -- but also has the potential to offer the benefit of enhanced localization. This paper studies the problem of localizing cameras in NeRF… ▽ More

    Submitted 23 December, 2023; originally announced December 2023.

  47. arXiv:2312.13332  [pdf, other

    cs.CV

    Ternary-Type Opacity and Hybrid Odometry for RGB NeRF-SLAM

    Authors: Junru Lin, Asen Nachkov, Songyou Peng, Luc Van Gool, Danda Pani Paudel

    Abstract: In this work, we address the challenge of deploying Neural Radiance Field (NeRFs) in Simultaneous Localization and Mapping (SLAM) under the condition of lacking depth information, relying solely on RGB inputs. The key to unlocking the full potential of NeRF in such a challenging context lies in the integration of real-world priors. A crucial prior we explore is the binary opacity prior of 3D space… ▽ More

    Submitted 23 September, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

    Comments: IROS 2024

  48. arXiv:2312.11578  [pdf, other

    cs.CV

    Diffusion-Based Particle-DETR for BEV Perception

    Authors: Asen Nachkov, Martin Danelljan, Danda Pani Paudel, Luc Van Gool

    Abstract: The Bird-Eye-View (BEV) is one of the most widely-used scene representations for visual perception in Autonomous Vehicles (AVs) due to its well suited compatibility to downstream tasks. For the enhanced safety of AVs, modeling perception uncertainty in BEV is crucial. Recent diffusion-based methods offer a promising approach to uncertainty modeling for visual perception but fail to effectively det… ▽ More

    Submitted 18 December, 2023; originally announced December 2023.

  49. arXiv:2312.08558  [pdf, other

    cs.CV

    Leveraging Driver Field-of-View for Multimodal Ego-Trajectory Prediction

    Authors: M. Eren Akbiyik, Nedko Savov, Danda Pani Paudel, Nikola Popovic, Christian Vater, Otmar Hilliges, Luc Van Gool, Xi Wang

    Abstract: Understanding drivers' decision-making is crucial for road safety. Although predicting the ego-vehicle's path is valuable for driver-assistance systems, existing methods mainly focus on external factors like other vehicles' motions, often neglecting the driver's attention and intent. To address this gap, we infer the ego-trajectory by integrating the driver's gaze and the surrounding scene. We int… ▽ More

    Submitted 15 April, 2025; v1 submitted 13 December, 2023; originally announced December 2023.

    Comments: Accepted to 13th International Conference on Learning Representations (ICLR 2025), 29 pages

  50. arXiv:2311.17119  [pdf, other

    cs.CV

    Continuous Pose for Monocular Cameras in Neural Implicit Representation

    Authors: Qi Ma, Danda Pani Paudel, Ajad Chhatkuli, Luc Van Gool

    Abstract: In this paper, we showcase the effectiveness of optimizing monocular camera poses as a continuous function of time. The camera poses are represented using an implicit neural function which maps the given time to the corresponding camera pose. The mapped camera poses are then used for the downstream tasks where joint camera pose optimization is also required. While doing so, the network parameters… ▽ More

    Submitted 2 March, 2024; v1 submitted 28 November, 2023; originally announced November 2023.