Skip to main content

Showing 1–50 of 141 results for author: Porikli, F

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.20879  [pdf, ps, other

    cs.CV

    MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans

    Authors: Shubhankar Borse, Seokeon Choi, Sunghyun Park, Jeongho Kim, Shreya Kadambi, Risheek Garrepalli, Sungrack Yun, Munawar Hayat, Fatih Porikli

    Abstract: Generation of images containing multiple humans, performing complex actions, while preserving their facial identities, is a significant challenge. A major factor contributing to this is the lack of a a dedicated benchmark. To address this, we introduce MultiHuman-Testbench, a novel benchmark for rigorously evaluating generative models for multi-human generation. The benchmark comprises 1800 sample… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  2. arXiv:2506.10242  [pdf, ps, other

    cs.CV

    DySS: Dynamic Queries and State-Space Learning for Efficient 3D Object Detection from Multi-Camera Videos

    Authors: Rajeev Yasarla, Shizhong Han, Hong Cai, Fatih Porikli

    Abstract: Camera-based 3D object detection in Bird's Eye View (BEV) is one of the most important perception tasks in autonomous driving. Earlier methods rely on dense BEV features, which are costly to construct. More recent works explore sparse query-based detection. However, they still require a large number of queries and can become expensive to run when more video frames are used. In this paper, we propo… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: CVPR 2025 Workshop on Autonomous Driving

  3. arXiv:2506.10145  [pdf, ps, other

    cs.CV

    RoCA: Robust Cross-Domain End-to-End Autonomous Driving

    Authors: Rajeev Yasarla, Shizhong Han, Hsin-Pai Cheng, Litian Liu, Shweta Mahajan, Apratim Bhattacharyya, Yunxiao Shi, Risheek Garrepalli, Hong Cai, Fatih Porikli

    Abstract: End-to-end (E2E) autonomous driving has recently emerged as a new paradigm, offering significant potential. However, few studies have looked into the practical challenge of deployment across domains (e.g., cities). Although several works have incorporated Large Language Models (LLMs) to leverage their open-world knowledge, LLMs do not guarantee cross-domain driving performance and may incur prohib… ▽ More

    Submitted 17 June, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

  4. arXiv:2506.09417  [pdf, ps, other

    cs.CV

    ODG: Occupancy Prediction Using Dual Gaussians

    Authors: Yunxiao Shi, Yinhao Zhu, Shizhong Han, Jisoo Jeong, Amin Ansari, Hong Cai, Fatih Porikli

    Abstract: Occupancy prediction infers fine-grained 3D geometry and semantics from camera images of the surrounding environment, making it a critical perception task for autonomous driving. Existing methods either adopt dense grids as scene representation, which is difficult to scale to high resolution, or learn the entire scene using a single set of sparse queries, which is insufficient to handle the variou… ▽ More

    Submitted 12 June, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

  5. arXiv:2506.07002  [pdf, ps, other

    cs.CV

    BePo: Leveraging Birds Eye View and Sparse Points for Efficient and Accurate 3D Occupancy Prediction

    Authors: Yunxiao Shi, Hong Cai, Jisoo Jeong, Yinhao Zhu, Shizhong Han, Amin Ansari, Fatih Porikli

    Abstract: 3D occupancy provides fine-grained 3D geometry and semantics for scene understanding which is critical for autonomous driving. Most existing methods, however, carry high compute costs, requiring dense 3D feature volume and cross-attention to effectively aggregate information. More recent works have adopted Bird's Eye View (BEV) or sparse points as scene representation with much reduced cost, but s… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

    Comments: Two-page abstract version available at CVPR 2025 Embodied AI Workshop

  6. arXiv:2506.04499  [pdf, ps, other

    cs.CV

    FALO: Fast and Accurate LiDAR 3D Object Detection on Resource-Constrained Devices

    Authors: Shizhong Han, Hsin-Pai Cheng, Hong Cai, Jihad Masri, Soyeb Nagori, Fatih Porikli

    Abstract: Existing LiDAR 3D object detection methods predominantely rely on sparse convolutions and/or transformers, which can be challenging to run on resource-constrained edge devices, due to irregular memory access patterns and high computational costs. In this paper, we propose FALO, a hardware-friendly approach to LiDAR 3D detection, which offers both state-of-the-art (SOTA) detection accuracy and fast… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  7. arXiv:2506.04244  [pdf, ps, other

    cs.AI

    Zero-Shot Adaptation of Parameter-Efficient Fine-Tuning in Diffusion Models

    Authors: Farzad Farhadzadeh, Debasmit Das, Shubhankar Borse, Fatih Porikli

    Abstract: We introduce ProLoRA, enabling zero-shot adaptation of parameter-efficient fine-tuning in text-to-image diffusion models. ProLoRA transfers pre-trained low-rank adjustments (e.g., LoRA) from a source to a target model without additional training data. This overcomes the limitations of traditional methods that require retraining when switching base models, often challenging due to data constraints.… ▽ More

    Submitted 29 May, 2025; originally announced June 2025.

    Comments: ICML 2025

  8. arXiv:2506.03290  [pdf, other

    cs.CV

    Learning Optical Flow Field via Neural Ordinary Differential Equation

    Authors: Leyla Mirvakhabova, Hong Cai, Jisoo Jeong, Hanno Ackermann, Farhad Zanjani, Fatih Porikli

    Abstract: Recent works on optical flow estimation use neural networks to predict the flow field that maps positions of one image to positions of the other. These networks consist of a feature extractor, a correlation volume, and finally several refinement steps. These refinement steps mimic the iterative refinements performed by classical optimization algorithms and are usually implemented by neural layers… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: CVPRW 2025

  9. arXiv:2506.00324  [pdf, other

    cs.CV

    Improving Optical Flow and Stereo Depth Estimation by Leveraging Uncertainty-Based Learning Difficulties

    Authors: Jisoo Jeong, Hong Cai, Jamie Menjay Lin, Fatih Porikli

    Abstract: Conventional training for optical flow and stereo depth models typically employs a uniform loss function across all pixels. However, this one-size-fits-all approach often overlooks the significant variations in learning difficulty among individual pixels and contextual regions. This paper investigates the uncertainty-based confidence maps which capture these spatially varying learning difficulties… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

    Comments: CVPRW2025

  10. arXiv:2504.13206  [pdf, other

    cs.GR

    DuoLoRA : Cycle-consistent and Rank-disentangled Content-Style Personalization

    Authors: Aniket Roy, Shubhankar Borse, Shreya Kadambi, Debasmit Das, Shweta Mahajan, Risheek Garrepalli, Hyojin Park, Ankita Nayak, Rama Chellappa, Munawar Hayat, Fatih Porikli

    Abstract: We tackle the challenge of jointly personalizing content and style from a few examples. A promising approach is to train separate Low-Rank Adapters (LoRA) and merge them effectively, preserving both content and style. Existing methods, such as ZipLoRA, treat content and style as independent entities, merging them by learning masks in LoRA's output dimensions. However, content and style are intertw… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  11. arXiv:2503.22172  [pdf, other

    cs.CV

    Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation

    Authors: Minho Park, Sunghyun Park, Jungsoo Lee, Hyojin Park, Kyuwoong Hwang, Fatih Porikli, Jaegul Choo, Sungha Choi

    Abstract: This paper addresses the challenge of data scarcity in semantic segmentation by generating datasets through text-to-image (T2I) generation models, reducing image acquisition and labeling costs. Segmentation dataset generation faces two key challenges: 1) aligning generated samples with the target domain and 2) producing informative samples beyond the training data. Fine-tuning T2I models can help… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

  12. arXiv:2503.18244  [pdf, other

    cs.CV

    CustomKD: Customizing Large Vision Foundation for Edge Model Improvement via Knowledge Distillation

    Authors: Jungsoo Lee, Debasmit Das, Munawar Hayat, Sungha Choi, Kyuwoong Hwang, Fatih Porikli

    Abstract: We propose a novel knowledge distillation approach, CustomKD, that effectively leverages large vision foundation models (LVFMs) to enhance the performance of edge models (e.g., MobileNetV3). Despite recent advancements in LVFMs, such as DINOv2 and CLIP, their potential in knowledge distillation for enhancing edge models remains underexplored. While knowledge distillation is a promising approach fo… ▽ More

    Submitted 23 March, 2025; originally announced March 2025.

    Comments: Accepted to CVPR 2025

  13. arXiv:2503.04059  [pdf, other

    cs.CV

    H3O: Hyper-Efficient 3D Occupancy Prediction with Heterogeneous Supervision

    Authors: Yunxiao Shi, Hong Cai, Amin Ansari, Fatih Porikli

    Abstract: 3D occupancy prediction has recently emerged as a new paradigm for holistic 3D scene understanding and provides valuable information for downstream planning in autonomous driving. Most existing methods, however, are computationally expensive, requiring costly attention-based 2D-3D transformation and 3D feature processing. In this paper, we present a novel 3D occupancy prediction approach, H3O, whi… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

    Comments: ICRA 2025

  14. arXiv:2502.19673  [pdf, other

    cs.CV

    SubZero: Composing Subject, Style, and Action via Zero-Shot Personalization

    Authors: Shubhankar Borse, Kartikeya Bhardwaj, Mohammad Reza Karimi Dastjerdi, Hyojin Park, Shreya Kadambi, Shobitha Shivakumar, Prathamesh Mandke, Ankita Nayak, Harris Teague, Munawar Hayat, Fatih Porikli

    Abstract: Diffusion models are increasingly popular for generative tasks, including personalized composition of subjects and styles. While diffusion models can generate user-specified subjects performing text-guided actions in custom styles, they require fine-tuning and are not feasible for personalization on mobile devices. Hence, tuning-free personalization methods such as IP-Adapters have progressively g… ▽ More

    Submitted 26 February, 2025; originally announced February 2025.

  15. arXiv:2501.16559  [pdf, other

    cs.CV

    LoRA-X: Bridging Foundation Models with Training-Free Cross-Model Adaptation

    Authors: Farzad Farhadzadeh, Debasmit Das, Shubhankar Borse, Fatih Porikli

    Abstract: The rising popularity of large foundation models has led to a heightened demand for parameter-efficient fine-tuning methods, such as Low-Rank Adaptation (LoRA), which offer performance comparable to full model fine-tuning while requiring only a few additional parameters tailored to the specific base model. When such base models are deprecated and replaced, all associated LoRA modules must be retra… ▽ More

    Submitted 4 February, 2025; v1 submitted 27 January, 2025; originally announced January 2025.

    Comments: Accepted to ICLR 2025

  16. arXiv:2501.09757  [pdf, other

    cs.CV cs.RO

    Distilling Multi-modal Large Language Models for Autonomous Driving

    Authors: Deepti Hegde, Rajeev Yasarla, Hong Cai, Shizhong Han, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Risheek Garrepalli, Vishal M. Patel, Fatih Porikli

    Abstract: Autonomous driving demands safe motion planning, especially in critical "long-tail" scenarios. Recent end-to-end autonomous driving systems leverage large language models (LLMs) as planners to improve generalizability to rare events. However, using LLMs at test time introduces high computational costs. To address this, we propose DiMA, an end-to-end autonomous driving system that maintains the eff… ▽ More

    Submitted 16 January, 2025; originally announced January 2025.

  17. arXiv:2412.17040  [pdf, other

    cs.LG

    HyperNet Fields: Efficiently Training Hypernetworks without Ground Truth by Learning Weight Trajectories

    Authors: Eric Hedlin, Munawar Hayat, Fatih Porikli, Kwang Moo Yi, Shweta Mahajan

    Abstract: To efficiently adapt large models or to train generative models of neural representations, Hypernetworks have drawn interest. While hypernetworks work well, training them is cumbersome, and often requires ground truth optimized weights for each sample. However, obtaining each of these weights is a training problem of its own-one needs to train, e.g., adaptation weights or even an entire neural fie… ▽ More

    Submitted 19 May, 2025; v1 submitted 22 December, 2024; originally announced December 2024.

  18. arXiv:2412.06578  [pdf, other

    cs.CV

    MoViE: Mobile Diffusion for Video Editing

    Authors: Adil Karjauv, Noor Fathima, Ioannis Lelekas, Fatih Porikli, Amir Ghodrati, Amirhossein Habibian

    Abstract: Recent progress in diffusion-based video editing has shown remarkable potential for practical applications. However, these methods remain prohibitively expensive and challenging to deploy on mobile devices. In this study, we introduce a series of optimizations that render mobile video editing feasible. Building upon the existing image editing model, we first optimize its architecture and incorpora… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

    Comments: 8 pages

  19. arXiv:2412.01931  [pdf, other

    cs.CV

    Planar Gaussian Splatting

    Authors: Farhad G. Zanjani, Hong Cai, Hanno Ackermann, Leila Mirvakhabova, Fatih Porikli

    Abstract: This paper presents Planar Gaussian Splatting (PGS), a novel neural rendering approach to learn the 3D geometry and parse the 3D planes of a scene, directly from multiple RGB images. The PGS leverages Gaussian primitives to model the scene and employ a hierarchical Gaussian mixture approach to group them. Similar Gaussians are progressively merged probabilistically in the tree-structured Gaussian… ▽ More

    Submitted 2 December, 2024; originally announced December 2024.

    Comments: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025

  20. arXiv:2411.01179  [pdf, other

    cs.CV cs.AI cs.GR cs.LG

    Hollowed Net for On-Device Personalization of Text-to-Image Diffusion Models

    Authors: Wonguk Cho, Seokeon Choi, Debasmit Das, Matthias Reisser, Taesup Kim, Sungrack Yun, Fatih Porikli

    Abstract: Recent advancements in text-to-image diffusion models have enabled the personalization of these models to generate custom images from textual prompts. This paper presents an efficient LoRA-based personalization approach for on-device subject-driven generation, where pre-trained diffusion models are fine-tuned with user-specific data on resource-constrained devices. Our method, termed Hollowed Net,… ▽ More

    Submitted 2 November, 2024; originally announced November 2024.

    Comments: NeurIPS 2024

  21. arXiv:2410.18931  [pdf, other

    cs.CV

    Sort-free Gaussian Splatting via Weighted Sum Rendering

    Authors: Qiqi Hou, Randall Rauwendaal, Zifeng Li, Hoang Le, Farzad Farhadzadeh, Fatih Porikli, Alexei Bourd, Amir Said

    Abstract: Recently, 3D Gaussian Splatting (3DGS) has emerged as a significant advancement in 3D scene reconstruction, attracting considerable attention due to its ability to recover high-fidelity details while maintaining low complexity. Despite the promising results achieved by 3DGS, its rendering performance is constrained by its dependence on costly non-commutative alpha-blending operations. These operat… ▽ More

    Submitted 8 April, 2025; v1 submitted 24 October, 2024; originally announced October 2024.

    Comments: ICLR 2025

  22. arXiv:2410.11971  [pdf, other

    cs.LG cs.AI cs.CV

    DDIL: Diversity Enhancing Diffusion Distillation With Imitation Learning

    Authors: Risheek Garrepalli, Shweta Mahajan, Munawar Hayat, Fatih Porikli

    Abstract: Diffusion models excel at generative modeling (e.g., text-to-image) but sampling requires multiple denoising network passes, limiting practicality. Efforts such as progressive distillation or consistency distillation have shown promise by reducing the number of passes at the expense of quality of the generated samples. In this work we identify co-variate shift as one of reason for poor performance… ▽ More

    Submitted 28 March, 2025; v1 submitted 15 October, 2024; originally announced October 2024.

  23. arXiv:2407.11306  [pdf, other

    cs.CV

    PADRe: A Unifying Polynomial Attention Drop-in Replacement for Efficient Vision Transformer

    Authors: Pierre-David Letourneau, Manish Kumar Singh, Hsin-Pai Cheng, Shizhong Han, Yunxiao Shi, Dalton Jones, Matthew Harper Langston, Hong Cai, Fatih Porikli

    Abstract: We present Polynomial Attention Drop-in Replacement (PADRe), a novel and unifying framework designed to replace the conventional self-attention mechanism in transformer models. Notably, several recent alternative attention mechanisms, including Hyena, Mamba, SimA, Conv2Former, and Castling-ViT, can be viewed as specific instances of our PADRe framework. PADRe leverages polynomial functions and dra… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  24. arXiv:2407.04800  [pdf, other

    cs.CV

    Segmentation-Free Guidance for Text-to-Image Diffusion Models

    Authors: Kambiz Azarian, Debasmit Das, Qiqi Hou, Fatih Porikli

    Abstract: We introduce segmentation-free guidance, a novel method designed for text-to-image diffusion models like Stable Diffusion. Our method does not require retraining of the diffusion model. At no additional compute cost, it uses the diffusion model itself as an implied segmentation network, hence named segmentation-free guidance, to dynamically adjust the negative prompt for each patch of the generate… ▽ More

    Submitted 3 June, 2024; originally announced July 2024.

  25. arXiv:2407.00021  [pdf, other

    cs.CV cs.GR eess.IV

    Neural Graphics Texture Compression Supporting Random Access

    Authors: Farzad Farhadzadeh, Qiqi Hou, Hoang Le, Amir Said, Randall Rauwendaal, Alex Bourd, Fatih Porikli

    Abstract: Advances in rendering have led to tremendous growth in texture assets, including resolution, complexity, and novel textures components, but this growth in data volume has not been matched by advances in its compression. Meanwhile Neural Image Compression (NIC) has advanced significantly and shown promising results, but the proposed methods cannot be directly adapted to neural texture compression.… ▽ More

    Submitted 25 October, 2024; v1 submitted 6 May, 2024; originally announced July 2024.

    Comments: ECCV 2024

  26. arXiv:2406.08816  [pdf, other

    cs.CV

    ToSA: Token Selective Attention for Efficient Vision Transformers

    Authors: Manish Kumar Singh, Rajeev Yasarla, Hong Cai, Mingu Lee, Fatih Porikli

    Abstract: In this paper, we propose a novel token selective attention approach, ToSA, which can identify tokens that need to be attended as well as those that can skip a transformer layer. More specifically, a token selector parses the current attention maps and predicts the attention maps for the next layer, which are then used to select the important tokens that should participate in the attention operati… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted at CVPRW 2024

  27. arXiv:2406.08798  [pdf, other

    cs.CV

    FouRA: Fourier Low Rank Adaptation

    Authors: Shubhankar Borse, Shreya Kadambi, Nilesh Prasad Pandey, Kartikeya Bhardwaj, Viswanath Ganapathy, Sweta Priyadarshi, Risheek Garrepalli, Rafael Esteves, Munawar Hayat, Fatih Porikli

    Abstract: While Low-Rank Adaptation (LoRA) has proven beneficial for efficiently fine-tuning large models, LoRA fine-tuned text-to-image diffusion models lack diversity in the generated images, as the model tends to copy data from the observed training samples. This effect becomes more pronounced at higher values of adapter strength and for adapters with higher ranks which are fine-tuned on smaller datasets… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  28. arXiv:2404.09918  [pdf, other

    cs.CV

    EdgeRelight360: Text-Conditioned 360-Degree HDR Image Generation for Real-Time On-Device Video Portrait Relighting

    Authors: Min-Hui Lin, Mahesh Reddy, Guillaume Berger, Michel Sarkis, Fatih Porikli, Ning Bi

    Abstract: In this paper, we present EdgeRelight360, an approach for real-time video portrait relighting on mobile devices, utilizing text-conditioned generation of 360-degree high dynamic range image (HDRI) maps. Our method proposes a diffusion-based text-to-360-degree image generation in the HDR domain, taking advantage of the HDR10 standard. This technique facilitates the generation of high-quality, reali… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: Camera-ready version (CVPR workshop - EDGE'24)

  29. arXiv:2404.08135  [pdf, other

    cs.CV

    SciFlow: Empowering Lightweight Optical Flow Models with Self-Cleaning Iterations

    Authors: Jamie Menjay Lin, Jisoo Jeong, Hong Cai, Risheek Garrepalli, Kai Wang, Fatih Porikli

    Abstract: Optical flow estimation is crucial to a variety of vision tasks. Despite substantial recent advancements, achieving real-time on-device optical flow estimation remains a complex challenge. First, an optical flow model must be sufficiently lightweight to meet computation and memory constraints to ensure real-time performance on devices. Second, the necessity for real-time on-device operation impose… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: CVPRW 2024

  30. arXiv:2403.18092  [pdf, other

    cs.CV

    OCAI: Improving Optical Flow Estimation by Occlusion and Consistency Aware Interpolation

    Authors: Jisoo Jeong, Hong Cai, Risheek Garrepalli, Jamie Menjay Lin, Munawar Hayat, Fatih Porikli

    Abstract: The scarcity of ground-truth labels poses one major challenge in developing optical flow estimation models that are both generalizable and robust. While current methods rely on data augmentation, they have yet to fully exploit the rich information available in labeled video sequences. We propose OCAI, a method that supports robust frame interpolation by generating intermediate video frames alongsi… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: CVPR 2024

  31. arXiv:2403.12953  [pdf, other

    cs.CV

    FutureDepth: Learning to Predict the Future Improves Video Depth Estimation

    Authors: Rajeev Yasarla, Manish Kumar Singh, Hong Cai, Yunxiao Shi, Jisoo Jeong, Yinhao Zhu, Shizhong Han, Risheek Garrepalli, Fatih Porikli

    Abstract: In this paper, we propose a novel video depth estimation approach, FutureDepth, which enables the model to implicitly leverage multi-frame and motion cues to improve depth estimation by making it learn to predict the future at training. More specifically, we propose a future prediction network, F-Net, which takes the features of multiple consecutive frames and is trained to predict multi-frame fea… ▽ More

    Submitted 16 January, 2025; v1 submitted 19 March, 2024; originally announced March 2024.

    Comments: ECCV 2024

  32. arXiv:2403.12202  [pdf, other

    cs.CV

    DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions

    Authors: Yunxiao Shi, Manish Kumar Singh, Hong Cai, Fatih Porikli

    Abstract: In this paper, we introduce a novel approach that harnesses both 2D and 3D attentions to enable highly accurate depth completion without requiring iterative spatial propagations. Specifically, we first enhance a baseline convolutional depth completion model by applying attention to 2D features in the bottleneck and skip connections. This effectively improves the performance of this simple network… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: Accepted at CVPR 2024

  33. arXiv:2403.09620  [pdf, other

    cs.CV

    PosSAM: Panoptic Open-vocabulary Segment Anything

    Authors: Vibashan VS, Shubhankar Borse, Hyojin Park, Debasmit Das, Vishal Patel, Munawar Hayat, Fatih Porikli

    Abstract: In this paper, we introduce an open-vocabulary panoptic segmentation model that effectively unifies the strengths of the Segment Anything Model (SAM) with the vision-language CLIP model in an end-to-end framework. While SAM excels in generating spatially-aware masks, it's decoder falls short in recognizing object class information and tends to oversegment without additional guidance. Existing appr… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

  34. arXiv:2402.16739  [pdf, other

    cs.CV

    Neural Mesh Fusion: Unsupervised 3D Planar Surface Understanding

    Authors: Farhad G. Zanjani, Hong Cai, Yinhao Zhu, Leyla Mirvakhabova, Fatih Porikli

    Abstract: This paper presents Neural Mesh Fusion (NMF), an efficient approach for joint optimization of polygon mesh from multi-view image observations and unsupervised 3D planar-surface parsing of the scene. In contrast to implicit neural representations, NMF directly learns to deform surface triangle mesh and generate an embedding for unsupervised 3D planar segmentation through gradient-based optimization… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

  35. arXiv:2402.09948  [pdf, other

    eess.SP cs.LG

    Neural 5G Indoor Localization with IMU Supervision

    Authors: Aleksandr Ermolov, Shreya Kadambi, Maximilian Arnold, Mohammed Hirzallah, Roohollah Amiri, Deepak Singh Mahendar Singh, Srinivas Yerramalli, Daniel Dijkman, Fatih Porikli, Taesang Yoo, Bence Major

    Abstract: Radio signals are well suited for user localization because they are ubiquitous, can operate in the dark and maintain privacy. Many prior works learn mappings between channel state information (CSI) and position fully-supervised. However, that approach relies on position labels which are very expensive to acquire. In this work, this requirement is relaxed by using pseudo-labels during deployment,… ▽ More

    Submitted 15 February, 2024; originally announced February 2024.

    Comments: IEEE GLOBECOM 2023

  36. arXiv:2401.07727  [pdf, other

    cs.CV

    HexaGen3D: StableDiffusion is just one step away from Fast and Diverse Text-to-3D Generation

    Authors: Antoine Mercier, Ramin Nakhli, Mahesh Reddy, Rajeev Yasarla, Hong Cai, Fatih Porikli, Guillaume Berger

    Abstract: Despite the latest remarkable advances in generative modeling, efficient generation of high-quality 3D assets from textual prompts remains a difficult task. A key challenge lies in data scarcity: the most extensive 3D datasets encompass merely millions of assets, while their 2D counterparts contain billions of text-image pairs. To address this, we propose a novel approach which harnesses the power… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

    Comments: 9 pages, 8 figures, 2 tables

  37. arXiv:2401.05735  [pdf, other

    cs.CV cs.LG

    Object-Centric Diffusion for Efficient Video Editing

    Authors: Kumara Kahatapitiya, Adil Karjauv, Davide Abati, Fatih Porikli, Yuki M. Asano, Amirhossein Habibian

    Abstract: Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we c… ▽ More

    Submitted 30 August, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

    Comments: ECCV24

  38. arXiv:2312.08128  [pdf, other

    cs.CV

    Clockwork Diffusion: Efficient Generation With Model-Step Distillation

    Authors: Amirhossein Habibian, Amir Ghodrati, Noor Fathima, Guillaume Sautiere, Risheek Garrepalli, Fatih Porikli, Jens Petersen

    Abstract: This work aims to improve the efficiency of text-to-image diffusion models. While diffusion models use computationally expensive UNet-based denoising operations in every generation step, we identify that not all operations are equally relevant for the final output quality. In particular, we observe that UNet layers operating on high-res feature maps are relatively sensitive to small perturbations.… ▽ More

    Submitted 20 February, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

  39. arXiv:2308.01483  [pdf, other

    cs.CV cs.GR cs.LG

    Efficient neural supersampling on a novel gaming dataset

    Authors: Antoine Mercier, Ruan Erasmus, Yashesh Savani, Manik Dhingra, Fatih Porikli, Guillaume Berger

    Abstract: Real-time rendering for video games has become increasingly challenging due to the need for higher resolutions, framerates and photorealism. Supersampling has emerged as an effective solution to address this challenge. Our work introduces a novel neural algorithm for supersampling rendered content that is 4 times more efficient than existing methods while maintaining the same level of accuracy. Ad… ▽ More

    Submitted 2 August, 2023; originally announced August 2023.

    Comments: ICCV'23

  40. arXiv:2307.14336  [pdf, other

    cs.CV

    MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation

    Authors: Rajeev Yasarla, Hong Cai, Jisoo Jeong, Yunxiao Shi, Risheek Garrepalli, Fatih Porikli

    Abstract: We propose MAMo, a novel memory and attention frame-work for monocular video depth estimation. MAMo can augment and improve any single-image depth estimation networks into video depth estimation models, enabling them to take advantage of the temporal information to predict more accurate depth. In MAMo, we augment model with memory which aids the depth prediction as the model streams through the vi… ▽ More

    Submitted 16 January, 2025; v1 submitted 26 July, 2023; originally announced July 2023.

    Comments: Accepted at ICCV 2023

  41. arXiv:2306.05691  [pdf, other

    cs.CV

    DIFT: Dynamic Iterative Field Transforms for Memory Efficient Optical Flow

    Authors: Risheek Garrepalli, Jisoo Jeong, Rajeswaran C Ravindran, Jamie Menjay Lin, Fatih Porikli

    Abstract: Recent advancements in neural network-based optical flow estimation often come with prohibitively high computational and memory requirements, presenting challenges in their model adaptation for mobile and low-power use cases. In this paper, we introduce a lightweight low-latency and memory-efficient model, Dynamic Iterative Field Transforms (DIFT), for optical flow estimation feasible for edge app… ▽ More

    Submitted 9 June, 2023; originally announced June 2023.

    Comments: CVPR MAI 2023 Accepted Paper

  42. arXiv:2306.03810  [pdf, other

    cs.CV cs.RO

    X-Align++: cross-modal cross-view alignment for Bird's-eye-view segmentation

    Authors: Shubhankar Borse, Senthil Yogamani, Marvin Klingner, Varun Ravi, Hong Cai, Abdulaziz Almuzairee, Fatih Porikli

    Abstract: Bird's-eye-view (BEV) grid is a typical representation of the perception of road components, e.g., drivable area, in autonomous driving. Most existing approaches rely on cameras only to perform segmentation in BEV space, which is fundamentally constrained by the absence of reliable depth information. The latest works leverage both camera and LiDAR modalities but suboptimally fuse their features us… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

    Comments: Accepted for publication at Springer Machine Vision and Applications Journal. The Version of Record of this article is published in Machine Vision and Applications Journal, and is available online at https://doi.org/10.1007/s00138-023-01400-7. arXiv admin note: substantial text overlap with arXiv:2210.06778

  43. arXiv:2305.10764  [pdf, other

    cs.CV

    OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding

    Authors: Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xuanlin Li, Shizhong Han, Hong Cai, Fatih Porikli, Hao Su

    Abstract: We introduce OpenShape, a method for learning multi-modal joint representations of text, image, and point clouds. We adopt the commonly used multi-modal contrastive learning framework for representation alignment, but with a specific focus on scaling up 3D representations to enable open-world 3D shape understanding. To achieve this, we scale up training data by ensembling multiple 3D datasets and… ▽ More

    Submitted 16 June, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

    Comments: Project Website: https://colin97.github.io/OpenShape/

  44. arXiv:2304.11431  [pdf, other

    cs.CV

    A Review of Deep Learning for Video Captioning

    Authors: Moloud Abdar, Meenakshi Kollati, Swaraja Kuraparthi, Farhad Pourpanah, Daniel McDuff, Mohammad Ghavamzadeh, Shuicheng Yan, Abduallah Mohamed, Abbas Khosravi, Erik Cambria, Fatih Porikli

    Abstract: Video captioning (VC) is a fast-moving, cross-disciplinary area of research that bridges work in the fields of computer vision, natural language processing (NLP), linguistics, and human-computer interaction. In essence, VC involves understanding a video and describing it with language. Captioning is used in a host of applications from creating more accessible interfaces (e.g., low-vision navigatio… ▽ More

    Submitted 22 April, 2023; originally announced April 2023.

    Comments: 42 pages, 10 figures

  45. arXiv:2304.05669  [pdf, other

    cs.CV cs.GR

    Factorized Inverse Path Tracing for Efficient and Accurate Material-Lighting Estimation

    Authors: Liwen Wu, Rui Zhu, Mustafa B. Yaldiz, Yinhao Zhu, Hong Cai, Janarbek Matai, Fatih Porikli, Tzu-Mao Li, Manmohan Chandraker, Ravi Ramamoorthi

    Abstract: Inverse path tracing has recently been applied to joint material and lighting estimation, given geometry and multi-view HDR observations of an indoor scene. However, it has two major limitations: path tracing is expensive to compute, and ambiguities exist between reflection and emission. Our Factorized Inverse Path Tracing (FIPT) addresses these challenges by using a factored light transport formu… ▽ More

    Submitted 23 August, 2023; v1 submitted 12 April, 2023; originally announced April 2023.

    Comments: Updated experiment results; modified real-world sections

  46. arXiv:2304.03369  [pdf, other

    cs.CV

    EGA-Depth: Efficient Guided Attention for Self-Supervised Multi-Camera Depth Estimation

    Authors: Yunxiao Shi, Hong Cai, Amin Ansari, Fatih Porikli

    Abstract: The ubiquitous multi-camera setup on modern autonomous vehicles provides an opportunity to construct surround-view depth. Existing methods, however, either perform independent monocular depth estimations on each camera or rely on computationally heavy self attention mechanisms. In this paper, we propose a novel guided attention architecture, EGA-Depth, which can improve both the efficiency and acc… ▽ More

    Submitted 6 April, 2023; originally announced April 2023.

    Comments: CVPR 2023 Workshop on Autonomous Driving

  47. arXiv:2303.15651  [pdf, other

    cs.CV

    4D Panoptic Segmentation as Invariant and Equivariant Field Prediction

    Authors: Minghan Zhu, Shizhong Han, Hong Cai, Shubhankar Borse, Maani Ghaffari, Fatih Porikli

    Abstract: In this paper, we develop rotation-equivariant neural networks for 4D panoptic segmentation. 4D panoptic segmentation is a benchmark task for autonomous driving that requires recognizing semantic classes and object instances on the road based on LiDAR scans, as well as assigning temporally consistent IDs to instances across time. We observe that the driving scenario is symmetric to rotations on th… ▽ More

    Submitted 12 September, 2023; v1 submitted 27 March, 2023; originally announced March 2023.

    Comments: 13 pages. Accepted at ICCV 2023

  48. arXiv:2303.14078  [pdf, other

    cs.CV

    DistractFlow: Improving Optical Flow Estimation via Realistic Distractions and Pseudo-Labeling

    Authors: Jisoo Jeong, Hong Cai, Risheek Garrepalli, Fatih Porikli

    Abstract: We propose a novel data augmentation approach, DistractFlow, for training optical flow estimation models by introducing realistic distractions to the input frames. Based on a mixing ratio, we combine one of the frames in the pair with a distractor image depicting a similar domain, which allows for inducing visual perturbations congruent with natural objects and scenes. We refer to such pairs as di… ▽ More

    Submitted 24 March, 2023; originally announced March 2023.

    Comments: CVPR 2023

  49. arXiv:2303.04336  [pdf, other

    eess.IV cs.CV cs.LG

    QuickSRNet: Plain Single-Image Super-Resolution Architecture for Faster Inference on Mobile Platforms

    Authors: Guillaume Berger, Manik Dhingra, Antoine Mercier, Yashesh Savani, Sunny Panchal, Fatih Porikli

    Abstract: In this work, we present QuickSRNet, an efficient super-resolution architecture for real-time applications on mobile platforms. Super-resolution clarifies, sharpens, and upscales an image to higher resolution. Applications such as gaming and video playback along with the ever-improving display capabilities of TVs, smartphones, and VR headsets are driving the need for efficient upscaling solutions.… ▽ More

    Submitted 14 May, 2023; v1 submitted 7 March, 2023; originally announced March 2023.

    Comments: Camera-ready version (CVPR workshop - MAI'23)

  50. arXiv:2303.02203  [pdf, other

    cs.CV cs.RO

    X$^3$KD: Knowledge Distillation Across Modalities, Tasks and Stages for Multi-Camera 3D Object Detection

    Authors: Marvin Klingner, Shubhankar Borse, Varun Ravi Kumar, Behnaz Rezaei, Venkatraman Narayanan, Senthil Yogamani, Fatih Porikli

    Abstract: Recent advances in 3D object detection (3DOD) have obtained remarkably strong results for LiDAR-based models. In contrast, surround-view 3DOD models based on multiple camera images underperform due to the necessary view transformation of features from perspective view (PV) to a 3D world representation which is ambiguous due to missing depth information. This paper introduces X$^3$KD, a comprehensi… ▽ More

    Submitted 3 March, 2023; originally announced March 2023.

    Comments: Accepted to CVPR 2023