-
HIRE: Lightweight High-Resolution Image Feature Enrichment for Multimodal LLMs
Authors:
Nikitha SR,
Aradhya Neeraj Mathur,
Tarun Ram Menta,
Rishabh Jain,
Mausoom Sarkar
Abstract:
The integration of high-resolution image features in modern multimodal large language models has demonstrated significant improvements in fine-grained visual understanding tasks, achieving high performance across multiple benchmarks. Since these features are obtained from large image encoders like ViT, they come with a significant increase in computational costs due to multiple calls to these enco…
▽ More
The integration of high-resolution image features in modern multimodal large language models has demonstrated significant improvements in fine-grained visual understanding tasks, achieving high performance across multiple benchmarks. Since these features are obtained from large image encoders like ViT, they come with a significant increase in computational costs due to multiple calls to these encoders. In this work, we first develop an intuition for feature upsampling as a natural extension of high-resolution feature generation. Through extensive experiments and ablations, we demonstrate how a shallow feature enricher can achieve competitive results with tremendous reductions in training and inference time as well as computational cost, with upto 1.5x saving in FLOPs.
△ Less
Submitted 21 June, 2025;
originally announced June 2025.
-
MVGaussian: High-Fidelity text-to-3D Content Generation with Multi-View Guidance and Surface Densification
Authors:
Phu Pham,
Aradhya N. Mathur,
Ojaswa Sharma,
Aniket Bera
Abstract:
The field of text-to-3D content generation has made significant progress in generating realistic 3D objects, with existing methodologies like Score Distillation Sampling (SDS) offering promising guidance. However, these methods often encounter the "Janus" problem-multi-face ambiguities due to imprecise guidance. Additionally, while recent advancements in 3D gaussian splitting have shown its effica…
▽ More
The field of text-to-3D content generation has made significant progress in generating realistic 3D objects, with existing methodologies like Score Distillation Sampling (SDS) offering promising guidance. However, these methods often encounter the "Janus" problem-multi-face ambiguities due to imprecise guidance. Additionally, while recent advancements in 3D gaussian splitting have shown its efficacy in representing 3D volumes, optimization of this representation remains largely unexplored. This paper introduces a unified framework for text-to-3D content generation that addresses these critical gaps. Our approach utilizes multi-view guidance to iteratively form the structure of the 3D model, progressively enhancing detail and accuracy. We also introduce a novel densification algorithm that aligns gaussians close to the surface, optimizing the structural integrity and fidelity of the generated models. Extensive experiments validate our approach, demonstrating that it produces high-quality visual outputs with minimal time cost. Notably, our method achieves high-quality results within half an hour of training, offering a substantial efficiency gain over most existing methods, which require hours of training time to achieve comparable results.
△ Less
Submitted 10 September, 2024;
originally announced September 2024.
-
Curvy: A Parametric Cross-section based Surface Reconstruction
Authors:
Aradhya N. Mathur,
Apoorv Khattar,
Ojaswa Sharma
Abstract:
In this work, we present a novel approach for reconstructing shape point clouds using planar sparse cross-sections with the help of generative modeling. We present unique challenges pertaining to the representation and reconstruction in this problem setting. Most methods in the classical literature lack the ability to generalize based on object class and employ complex mathematical machinery to re…
▽ More
In this work, we present a novel approach for reconstructing shape point clouds using planar sparse cross-sections with the help of generative modeling. We present unique challenges pertaining to the representation and reconstruction in this problem setting. Most methods in the classical literature lack the ability to generalize based on object class and employ complex mathematical machinery to reconstruct reliable surfaces. We present a simple learnable approach to generate a large number of points from a small number of input cross-sections over a large dataset. We use a compact parametric polyline representation using adaptive splitting to represent the cross-sections and perform learning using a Graph Neural Network to reconstruct the underlying shape in an adaptive manner reducing the dependence on the number of cross-sections provided.
△ Less
Submitted 1 September, 2024;
originally announced September 2024.
-
RL Dreams: Policy Gradient Optimization for Score Distillation based 3D Generation
Authors:
Aradhya N. Mathur,
Phu Pham,
Aniket Bera,
Ojaswa Sharma
Abstract:
3D generation has rapidly accelerated in the past decade owing to the progress in the field of generative modeling. Score Distillation Sampling (SDS) based rendering has improved 3D asset generation to a great extent. Further, the recent work of Denoising Diffusion Policy Optimization (DDPO) demonstrates that the diffusion process is compatible with policy gradient methods and has been demonstrate…
▽ More
3D generation has rapidly accelerated in the past decade owing to the progress in the field of generative modeling. Score Distillation Sampling (SDS) based rendering has improved 3D asset generation to a great extent. Further, the recent work of Denoising Diffusion Policy Optimization (DDPO) demonstrates that the diffusion process is compatible with policy gradient methods and has been demonstrated to improve the 2D diffusion models using an aesthetic scoring function. We first show that this aesthetic scorer acts as a strong guide for a variety of SDS-based methods and demonstrates its effectiveness in text-to-3D synthesis. Further, we leverage the DDPO approach to improve the quality of the 3D rendering obtained from 2D diffusion models. Our approach, DDPO3D, employs the policy gradient method in tandem with aesthetic scoring. To the best of our knowledge, this is the first method that extends policy gradient methods to 3D score-based rendering and shows improvement across SDS-based methods such as DreamGaussian, which are currently driving research in text-to-3D synthesis. Our approach is compatible with score distillation-based methods, which would facilitate the integration of diverse reward functions into the generative process. Our project page can be accessed via https://ddpo3d.github.io.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
LIFI: Towards Linguistically Informed Frame Interpolation
Authors:
Aradhya Neeraj Mathur,
Devansh Batra,
Yaman Kumar,
Rajiv Ratn Shah,
Roger Zimmermann
Abstract:
In this work, we explore a new problem of frame interpolation for speech videos. Such content today forms the major form of online communication. We try to solve this problem by using several deep learning video generation algorithms to generate the missing frames. We also provide examples where computer vision models despite showing high performance on conventional non-linguistic metrics fail to…
▽ More
In this work, we explore a new problem of frame interpolation for speech videos. Such content today forms the major form of online communication. We try to solve this problem by using several deep learning video generation algorithms to generate the missing frames. We also provide examples where computer vision models despite showing high performance on conventional non-linguistic metrics fail to accurately produce faithful interpolation of speech. With this motivation, we provide a new set of linguistically-informed metrics specifically targeted to the problem of speech videos interpolation. We also release several datasets to test computer vision video generation models of their speech understanding.
△ Less
Submitted 2 December, 2020; v1 submitted 30 October, 2020;
originally announced October 2020.
-
Multimodal Medical Volume Colorization from 2D Style
Authors:
Aradhya Neeraj Mathur,
Apoorv Khattar,
Ojaswa Sharma
Abstract:
Colorization involves the synthesis of colors on a target image while preserving structural content as well as the semantics of the target image. This is a well-explored problem in 2D with many state-of-the-art solutions. We propose a novel deep learning-based approach for the colorization of 3D medical volumes. Our system is capable of directly mapping the colors of a 2D photograph to a 3D MRI vo…
▽ More
Colorization involves the synthesis of colors on a target image while preserving structural content as well as the semantics of the target image. This is a well-explored problem in 2D with many state-of-the-art solutions. We propose a novel deep learning-based approach for the colorization of 3D medical volumes. Our system is capable of directly mapping the colors of a 2D photograph to a 3D MRI volume in real-time, producing a high-fidelity color volume suitable for photo-realistic visualization. Since this work is first of its kind, we discuss the full pipeline in detail and the challenges that it brings for 3D medical data. The colorization of medical MRI volume also entails modality conversion that highlights the robustness of our approach in handling multi-modal data.
△ Less
Submitted 6 April, 2020;
originally announced April 2020.