Search | arXiv e-print repository

Embodiment in multimodal large language models

Authors: Akila Kadambi, Lisa Aziz-Zadeh, Antonio Damasio, Marco Iacoboni, Srini Narayanan

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated extraordinary progress in bridging textual and visual inputs. However, MLLMs still face challenges in situated physical and social interactions in sensorally rich, multimodal and real-world settings where the embodied experience of the living organism is essential. We posit that next frontiers for MLLM development require incorporating bot… ▽ More Multimodal Large Language Models (MLLMs) have demonstrated extraordinary progress in bridging textual and visual inputs. However, MLLMs still face challenges in situated physical and social interactions in sensorally rich, multimodal and real-world settings where the embodied experience of the living organism is essential. We posit that next frontiers for MLLM development require incorporating both internal and external embodiment -- modeling not only external interactions with the world, but also internal states and drives. Here, we describe mechanisms of internal and external embodiment in humans and relate these to current advances in MLLMs in early stages of aligning to human representations. Our dual-embodied framework proposes to model interactions between these forms of embodiment in MLLMs to bridge the gap between multimodal data and world experience. △ Less

Submitted 11 October, 2025; originally announced October 2025.

arXiv:2510.04390 [pdf, ps, other]

MorphoSim: An Interactive, Controllable, and Editable Language-guided 4D World Simulator

Authors: Xuehai He, Shijie Zhou, Thivyanth Venkateswaran, Kaizhi Zheng, Ziyu Wan, Achuta Kadambi, Xin Eric Wang

Abstract: World models that support controllable and editable spatiotemporal environments are valuable for robotics, enabling scalable training data, repro ducible evaluation, and flexible task design. While recent text-to-video models generate realistic dynam ics, they are constrained to 2D views and offer limited interaction. We introduce MorphoSim, a language guided framework that generates 4D sc… ▽ More World models that support controllable and editable spatiotemporal environments are valuable for robotics, enabling scalable training data, repro ducible evaluation, and flexible task design. While recent text-to-video models generate realistic dynam ics, they are constrained to 2D views and offer limited interaction. We introduce MorphoSim, a language guided framework that generates 4D scenes with multi-view consistency and object-level controls. From natural language instructions, MorphoSim produces dynamic environments where objects can be directed, recolored, or removed, and scenes can be observed from arbitrary viewpoints. The framework integrates trajectory-guided generation with feature field dis tillation, allowing edits to be applied interactively without full re-generation. Experiments show that Mor phoSim maintains high scene fidelity while enabling controllability and editability. The code is available at https://github.com/eric-ai-lab/Morph4D. △ Less

Submitted 5 October, 2025; originally announced October 2025.

arXiv:2509.23517 [pdf, ps, other]

Evaluating point-light biological motion in multimodal large language models

Authors: Akila Kadambi, Marco Iacoboni, Lisa Aziz-Zadeh, Srini Narayanan

Abstract: Humans can extract rich semantic information from minimal visual cues, as demonstrated by point-light displays (PLDs), which consist of sparse sets of dots localized to key joints of the human body. This ability emerges early in development and is largely attributed to human embodied experience. Since PLDs isolate body motion as the sole source of meaning, they represent key stimuli for testing th… ▽ More Humans can extract rich semantic information from minimal visual cues, as demonstrated by point-light displays (PLDs), which consist of sparse sets of dots localized to key joints of the human body. This ability emerges early in development and is largely attributed to human embodied experience. Since PLDs isolate body motion as the sole source of meaning, they represent key stimuli for testing the constraints of action understanding in these systems. Here we introduce ActPLD, the first benchmark to evaluate action processing in MLLMs from human PLDs. Tested models include state-of-the-art proprietary and open-source systems on single-actor and socially interacting PLDs. Our results reveal consistently low performance across models, introducing fundamental gaps in action and spatiotemporal understanding. △ Less

Submitted 27 September, 2025; originally announced September 2025.

arXiv:2508.02095 [pdf, ps, other]

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Authors: Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Xin Eric Wang, Achuta Kadambi

Abstract: Vision language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason about object movements, rotations, and perspective shifts-abilities essential for robust dynamic real-world understanding yet notably lacking in current VLMs. In th… ▽ More Vision language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason about object movements, rotations, and perspective shifts-abilities essential for robust dynamic real-world understanding yet notably lacking in current VLMs. In this paper, we introduce VLM4D, the first benchmark specifically designed to evaluate the spatiotemporal reasoning capabilities of VLMs. Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs emphasizing translational and rotational motions, perspective awareness, and motion continuity. Through comprehensive evaluations of state-of-the-art open and closed-source VLMs, we identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models. Extensive analysis reveals that VLMs struggle particularly with integrating multiple visual cues and maintaining temporal coherence. We further explore promising directions, such as leveraging 4D feature field reconstruction and targeted spatiotemporal supervised fine-tuning, demonstrating their effectiveness in enhancing spatiotemporal comprehension. Our work aims to encourage deeper exploration into improving VLMs' spatial and temporal grounding, paving the way towards more capable and reliable visual intelligence for dynamic environments. △ Less

Submitted 6 August, 2025; v1 submitted 4 August, 2025; originally announced August 2025.

Comments: ICCV 2025, Project Website: https://vlm4d.github.io/

arXiv:2503.20776 [pdf, other]

Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields

Authors: Shijie Zhou, Hui Ren, Yijia Weng, Shuwang Zhang, Zhen Wang, Dejia Xu, Zhiwen Fan, Suya You, Zhangyang Wang, Leonidas Guibas, Achuta Kadambi

Abstract: Recent advancements in 2D and multimodal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets,… ▽ More Recent advancements in 2D and multimodal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into the 4D realm, using only monocular video input, which is widely available from user-generated content. The "X" in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single representation. Additionally, to the best of our knowledge, Feature4X is the first method to distill and lift the features of video foundation models (e.g., SAM2, InternVideo2) into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in feedback loops. These advancements broaden the scope of agentic AI applications by providing a foundation for scalable, contextually and spatiotemporally aware systems capable of immersive dynamic 4D scene interaction. △ Less

Submitted 28 March, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

arXiv:2412.10846 [pdf]

Detecting Activities of Daily Living in Egocentric Video to Contextualize Hand Use at Home in Outpatient Neurorehabilitation Settings

Authors: Adesh Kadambi, José Zariffa

Abstract: Wearable egocentric cameras and machine learning have the potential to provide clinicians with a more nuanced understanding of patient hand use at home after stroke and spinal cord injury (SCI). However, they require detailed contextual information (i.e., activities and object interactions) to effectively interpret metrics and meaningfully guide therapy planning. We demonstrate that an object-cent… ▽ More Wearable egocentric cameras and machine learning have the potential to provide clinicians with a more nuanced understanding of patient hand use at home after stroke and spinal cord injury (SCI). However, they require detailed contextual information (i.e., activities and object interactions) to effectively interpret metrics and meaningfully guide therapy planning. We demonstrate that an object-centric approach, focusing on what objects patients interact with rather than how they move, can effectively recognize Activities of Daily Living (ADL) in real-world rehabilitation settings. We evaluated our models on a complex dataset collected in the wild comprising 2261 minutes of egocentric video from 16 participants with impaired hand function. By leveraging pre-trained object detection and hand-object interaction models, our system achieves robust performance across different impairment levels and environments, with our best model achieving a mean weighted F1-score of 0.78 +/- 0.12 and maintaining an F1-score > 0.5 for all participants using leave-one-subject-out cross validation. Through qualitative analysis, we observe that this approach generates clinically interpretable information about functional object use while being robust to patient-specific movement variations, making it particularly suitable for rehabilitation contexts with prevalent upper limb impairment. △ Less

Submitted 14 December, 2024; originally announced December 2024.

Comments: To be submitted to IEEE Transactions on Neural Systems and Rehabilitation Engineering. 11 pages, 3 figures, 2 tables

arXiv:2412.06753 [pdf, other]

InstantRestore: Single-Step Personalized Face Restoration with Shared-Image Attention

Authors: Howard Zhang, Yuval Alaluf, Sizhuo Ma, Achuta Kadambi, Jian Wang, Kfir Aberman

Abstract: Face image restoration aims to enhance degraded facial images while addressing challenges such as diverse degradation types, real-time processing demands, and, most crucially, the preservation of identity-specific features. Existing methods often struggle with slow processing times and suboptimal restoration, especially under severe degradation, failing to accurately reconstruct finer-level identi… ▽ More Face image restoration aims to enhance degraded facial images while addressing challenges such as diverse degradation types, real-time processing demands, and, most crucially, the preservation of identity-specific features. Existing methods often struggle with slow processing times and suboptimal restoration, especially under severe degradation, failing to accurately reconstruct finer-level identity details. To address these issues, we introduce InstantRestore, a novel framework that leverages a single-step image diffusion model and an attention-sharing mechanism for fast and personalized face restoration. Additionally, InstantRestore incorporates a novel landmark attention loss, aligning key facial landmarks to refine the attention maps, enhancing identity preservation. At inference time, given a degraded input and a small (~4) set of reference images, InstantRestore performs a single forward pass through the network to achieve near real-time performance. Unlike prior approaches that rely on full diffusion processes or per-identity model tuning, InstantRestore offers a scalable solution suitable for large-scale applications. Extensive experiments demonstrate that InstantRestore outperforms existing methods in quality and speed, making it an appealing choice for identity-preserving face restoration. △ Less

Submitted 9 December, 2024; originally announced December 2024.

Comments: Project page: https://snap-research.github.io/InstantRestore/

arXiv:2412.00372 [pdf, other]

2-Factor Retrieval for Improved Human-AI Decision Making in Radiology

Authors: Jim Solomon, Laleh Jalilian, Alexander Vilesov, Meryl Mathew, Tristan Grogan, Arash Bedayat, Achuta Kadambi

Abstract: Human-machine teaming in medical AI requires us to understand to what degree a trained clinician should weigh AI predictions. While previous work has shown the potential of AI assistance at improving clinical predictions, existing clinical decision support systems either provide no explainability of their predictions or use techniques like saliency and Shapley values, which do not allow for physic… ▽ More Human-machine teaming in medical AI requires us to understand to what degree a trained clinician should weigh AI predictions. While previous work has shown the potential of AI assistance at improving clinical predictions, existing clinical decision support systems either provide no explainability of their predictions or use techniques like saliency and Shapley values, which do not allow for physician-based verification. To address this gap, this study compares previously used explainable AI techniques with a newly proposed technique termed '2-factor retrieval (2FR)', which is a combination of interface design and search retrieval that returns similarly labeled data without processing this data. This results in a 2-factor security blanket where: (a) correct images need to be retrieved by the AI; and (b) humans should associate the retrieved images with the current pathology under test. We find that when tested on chest X-ray diagnoses, 2FR leads to increases in clinician accuracy, with particular improvements when clinicians are radiologists and have low confidence in their decision. Our results highlight the importance of understanding how different modes of human-AI decision making may impact clinician accuracy in clinical decision support systems. △ Less

Submitted 30 November, 2024; originally announced December 2024.

arXiv:2410.18956 [pdf, other]

Large Spatial Model: End-to-end Unposed Images to Semantic 3D

Authors: Zhiwen Fan, Jian Zhang, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, Boris Ivanovic, Marco Pavone, Yue Wang

Abstract: Reconstructing and understanding 3D structures from a limited number of images is a well-established problem in computer vision. Traditional methods usually break this task into multiple subtasks, each requiring complex transformations between different data representations. For instance, dense reconstruction through Structure-from-Motion (SfM) involves converting images into key points, optimizin… ▽ More Reconstructing and understanding 3D structures from a limited number of images is a well-established problem in computer vision. Traditional methods usually break this task into multiple subtasks, each requiring complex transformations between different data representations. For instance, dense reconstruction through Structure-from-Motion (SfM) involves converting images into key points, optimizing camera parameters, and estimating structures. Afterward, accurate sparse reconstructions are required for further dense modeling, which is subsequently fed into task-specific neural networks. This multi-step process results in considerable processing time and increased engineering complexity. In this work, we present the Large Spatial Model (LSM), which processes unposed RGB images directly into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation, and it can generate versatile label maps by interacting with language at novel viewpoints. Leveraging a Transformer-based architecture, LSM integrates global geometry through pixel-aligned point maps. To enhance spatial attribute regression, we incorporate local context aggregation with multi-scale fusion, improving the accuracy of fine local details. To tackle the scarcity of labeled 3D semantic data and enable natural language-driven scene manipulation, we incorporate a pre-trained 2D language-based segmentation model into a 3D-consistent semantic feature field. An efficient decoder then parameterizes a set of semantic anisotropic Gaussians, facilitating supervised end-to-end learning. Extensive experiments across various tasks show that LSM unifies multiple 3D vision tasks directly from unposed images, achieving real-time semantic 3D reconstruction for the first time. △ Less

Submitted 30 October, 2024; v1 submitted 24 October, 2024; originally announced October 2024.

Comments: Project Website: https://largespatialmodel.github.io

arXiv:2407.16902 [pdf, other]

The Potential and Perils of Generative Artificial Intelligence for Quality Improvement and Patient Safety

Authors: Laleh Jalilian, Daniel McDuff, Achuta Kadambi

Abstract: Generative artificial intelligence (GenAI) has the potential to improve healthcare through automation that enhances the quality and safety of patient care. Powered by foundation models that have been pretrained and can generate complex content, GenAI represents a paradigm shift away from the more traditional focus on task-specific classifiers that have dominated the AI landscape thus far. We posit… ▽ More Generative artificial intelligence (GenAI) has the potential to improve healthcare through automation that enhances the quality and safety of patient care. Powered by foundation models that have been pretrained and can generate complex content, GenAI represents a paradigm shift away from the more traditional focus on task-specific classifiers that have dominated the AI landscape thus far. We posit that the imminent application of GenAI in healthcare will be through well-defined, low risk, high value, and narrow applications that automate healthcare workflows at the point of care using smaller foundation models. These models will be finetuned for different capabilities and application specific scenarios and will have the ability to provide medical explanations, reference evidence within a retrieval augmented framework and utilizing external tools. We contrast this with a general, all-purpose AI model for end-to-end clinical decision making that improves clinician performance, including safety-critical diagnostic tasks, which will require greater research prior to implementation. We consider areas where 'human in the loop' Generative AI can improve healthcare quality and safety by automating mundane tasks. Using the principles of implementation science will be critical for integrating 'end to end' GenAI systems that will be accepted by healthcare teams. △ Less

Submitted 23 June, 2024; originally announced July 2024.

arXiv:2407.11936 [pdf, other]

Thermal Imaging and Radar for Remote Sleep Monitoring of Breathing and Apnea

Authors: Kai Del Regno, Alexander Vilesov, Adnan Armouti, Anirudh Bindiganavale Harish, Selim Emir Can, Ashley Kita, Achuta Kadambi

Abstract: Polysomnography (PSG), the current gold standard method for monitoring and detecting sleep disorders, is cumbersome and costly. At-home testing solutions, known as home sleep apnea testing (HSAT), exist. However, they are contact-based, a feature which limits the ability of some patient populations to tolerate testing and discourages widespread deployment. Previous work on non-contact sleep monito… ▽ More Polysomnography (PSG), the current gold standard method for monitoring and detecting sleep disorders, is cumbersome and costly. At-home testing solutions, known as home sleep apnea testing (HSAT), exist. However, they are contact-based, a feature which limits the ability of some patient populations to tolerate testing and discourages widespread deployment. Previous work on non-contact sleep monitoring for sleep apnea detection either estimates respiratory effort using radar or nasal airflow using a thermal camera, but has not compared the two or used them together. We conducted a study on 10 participants, ages 34 - 78, with suspected sleep disorders using a hardware setup with a synchronized radar and thermal camera. We show the first comparison of radar and thermal imaging for sleep monitoring, and find that our thermal imaging method outperforms radar significantly. Our thermal imaging method detects apneas with an accuracy of 0.99, a precision of 0.68, a recall of 0.74, an F1 score of 0.71, and an intra-class correlation of 0.70; our radar method detects apneas with an accuracy of 0.83, a precision of 0.13, a recall of 0.86, an F1 score of 0.22, and an intra-class correlation of 0.13. We also present a novel proposal for classifying obstructive and central sleep apnea by leveraging a multimodal setup. This method could be used accurately detect and classify apneas during sleep with non-contact sensors, thereby improving diagnostic capacities in patient populations unable to tolerate current technology. △ Less

Submitted 7 August, 2024; v1 submitted 16 July, 2024; originally announced July 2024.

arXiv:2407.04169 [pdf, other]

Solutions to Deepfakes: Can Camera Hardware, Cryptography, and Deep Learning Verify Real Images?

Authors: Alexander Vilesov, Yuan Tian, Nader Sehatbakhsh, Achuta Kadambi

Abstract: The exponential progress in generative AI poses serious implications for the credibility of all real images and videos. There will exist a point in the future where 1) digital content produced by generative AI will be indistinguishable from those created by cameras, 2) high-quality generative algorithms will be accessible to anyone, and 3) the ratio of all synthetic to real images will be large. I… ▽ More The exponential progress in generative AI poses serious implications for the credibility of all real images and videos. There will exist a point in the future where 1) digital content produced by generative AI will be indistinguishable from those created by cameras, 2) high-quality generative algorithms will be accessible to anyone, and 3) the ratio of all synthetic to real images will be large. It is imperative to establish methods that can separate real data from synthetic data with high confidence. We define real images as those that were produced by the camera hardware, capturing a real-world scene. Any synthetic generation of an image or alteration of a real image through generative AI or computer graphics techniques is labeled as a synthetic image. To this end, this document aims to: present known strategies in detection and cryptography that can be employed to verify which images are real, weight the strengths and weaknesses of these strategies, and suggest additional improvements to alleviate shortcomings. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2406.13527 [pdf, other]

4K4DGen: Panoramic 4D Generation at 4K Resolution

Authors: Renjie Li, Panwang Pan, Bangbang Yang, Dejia Xu, Shijie Zhou, Xuanyang Zhang, Zeming Li, Achuta Kadambi, Zhangyang Wang, Zhengzhong Tu, Zhiwen Fan

Abstract: The blooming of virtual reality and augmented reality (VR/AR) technologies has driven an increasing demand for the creation of high-quality, immersive, and dynamic environments. However, existing generative techniques either focus solely on dynamic objects or perform outpainting from a single perspective image, failing to meet the requirements of VR/AR applications that need free-viewpoint, 360… ▽ More The blooming of virtual reality and augmented reality (VR/AR) technologies has driven an increasing demand for the creation of high-quality, immersive, and dynamic environments. However, existing generative techniques either focus solely on dynamic objects or perform outpainting from a single perspective image, failing to meet the requirements of VR/AR applications that need free-viewpoint, 360$^{\circ}$ virtual views where users can move in all directions. In this work, we tackle the challenging task of elevating a single panorama to an immersive 4D experience. For the first time, we demonstrate the capability to generate omnidirectional dynamic scenes with 360$^{\circ}$ views at 4K (4096 $\times$ 2048) resolution, thereby providing an immersive user experience. Our method introduces a pipeline that facilitates natural scene animations and optimizes a set of dynamic Gaussians using efficient splatting techniques for real-time exploration. To overcome the lack of scene-scale annotated 4D data and models, especially in panoramic formats, we propose a novel \textbf{Panoramic Denoiser} that adapts generic 2D diffusion priors to animate consistently in 360$^{\circ}$ images, transforming them into panoramic videos with dynamic scenes at targeted regions. Subsequently, we propose \textbf{Dynamic Panoramic Lifting} to elevate the panoramic video into a 4D immersive environment while preserving spatial and temporal consistency. By transferring prior knowledge from 2D models in the perspective domain to the panoramic domain and the 4D lifting with spatial appearance and geometry regularization, we achieve high-quality Panorama-to-4D generation at a resolution of 4K for the first time. △ Less

Submitted 3 October, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

arXiv:2405.17315 [pdf, other]

All-day Depth Completion

Authors: Vadim Ezhov, Hyoungseob Park, Zhaoyang Zhang, Rishi Upadhyay, Howard Zhang, Chethan Chinder Chandrappa, Achuta Kadambi, Yunhao Ba, Julie Dorsey, Alex Wong

Abstract: We propose a method for depth estimation under different illumination conditions, i.e., day and night time. As photometry is uninformative in regions under low-illumination, we tackle the problem through a multi-sensor fusion approach, where we take as input an additional synchronized sparse point cloud (i.e., from a LiDAR) projected onto the image plane as a sparse depth map, along with a camera… ▽ More We propose a method for depth estimation under different illumination conditions, i.e., day and night time. As photometry is uninformative in regions under low-illumination, we tackle the problem through a multi-sensor fusion approach, where we take as input an additional synchronized sparse point cloud (i.e., from a LiDAR) projected onto the image plane as a sparse depth map, along with a camera image. The crux of our method lies in the use of the abundantly available synthetic data to first approximate the 3D scene structure by learning a mapping from sparse to (coarse) dense depth maps along with their predictive uncertainty - we term this, SpaDe. In poorly illuminated regions where photometric intensities do not afford the inference of local shape, the coarse approximation of scene depth serves as a prior; the uncertainty map is then used with the image to guide refinement through an uncertainty-driven residual learning (URL) scheme. The resulting depth completion network leverages complementary strengths from both modalities - depth is sparse but insensitive to illumination and in metric scale, and image is dense but sensitive with scale ambiguity. SpaDe can be used in a plug-and-play fashion, which allows for 25% improvement when augmented onto existing methods to preprocess sparse depth. We demonstrate URL on the nuScenes dataset where we improve over all baselines by an average 11.65% in all-day scenarios, 11.23% when tested specifically for daytime, and 13.12% for nighttime scenes. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: 8 pages, 4 figures

arXiv:2404.06903 [pdf, other]

DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting

Authors: Shijie Zhou, Zhiwen Fan, Dejia Xu, Haoran Chang, Pradyumna Chari, Tejas Bharadwaj, Suya You, Zhangyang Wang, Achuta Kadambi

Abstract: The increasing demand for virtual reality applications has highlighted the significance of crafting immersive 3D assets. We present a text-to-3D 360$^{\circ}$ scene generation pipeline that facilitates the creation of comprehensive 360$^{\circ}$ scenes for in-the-wild environments in a matter of minutes. Our approach utilizes the generative power of a 2D diffusion model and prompt self-refinement… ▽ More The increasing demand for virtual reality applications has highlighted the significance of crafting immersive 3D assets. We present a text-to-3D 360$^{\circ}$ scene generation pipeline that facilitates the creation of comprehensive 360$^{\circ}$ scenes for in-the-wild environments in a matter of minutes. Our approach utilizes the generative power of a 2D diffusion model and prompt self-refinement to create a high-quality and globally coherent panoramic image. This image acts as a preliminary "flat" (2D) scene representation. Subsequently, it is lifted into 3D Gaussians, employing splatting techniques to enable real-time exploration. To produce consistent 3D geometry, our pipeline constructs a spatially coherent structure by aligning the 2D monocular depth into a globally optimized point cloud. This point cloud serves as the initial state for the centroids of 3D Gaussians. In order to address invisible issues inherent in single-view inputs, we impose semantic and geometric constraints on both synthesized and input camera views as regularizations. These guide the optimization of Gaussians, aiding in the reconstruction of unseen regions. In summary, our method offers a globally consistent 3D scene within a 360$^{\circ}$ perspective, providing an enhanced immersive experience over existing techniques. Project website at: http://dreamscene360.github.io/ △ Less

Submitted 25 July, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

arXiv:2403.14874 [pdf, other]

WeatherProof: Leveraging Language Guidance for Semantic Segmentation in Adverse Weather

Authors: Blake Gella, Howard Zhang, Rishi Upadhyay, Tiffany Chang, Nathan Wei, Matthew Waliman, Yunhao Ba, Celso de Melo, Alex Wong, Achuta Kadambi

Abstract: We propose a method to infer semantic segmentation maps from images captured under adverse weather conditions. We begin by examining existing models on images degraded by weather conditions such as rain, fog, or snow, and found that they exhibit a large performance drop as compared to those captured under clear weather. To control for changes in scene structures, we propose WeatherProof, the first… ▽ More We propose a method to infer semantic segmentation maps from images captured under adverse weather conditions. We begin by examining existing models on images degraded by weather conditions such as rain, fog, or snow, and found that they exhibit a large performance drop as compared to those captured under clear weather. To control for changes in scene structures, we propose WeatherProof, the first semantic segmentation dataset with accurate clear and adverse weather image pairs that share an underlying scene. Through this dataset, we analyze the error modes in existing models and found that they were sensitive to the highly complex combination of different weather effects induced on the image during capture. To improve robustness, we propose a way to use language as guidance by identifying contributions of adverse weather conditions and injecting that as "side information". Models trained using our language guidance exhibit performance gains by up to 10.2% in mIoU on WeatherProof, up to 8.44% in mIoU on the widely used ACDC dataset compared to standard training techniques, and up to 6.21% in mIoU on the ACDC dataset as compared to previous SOTA methods. △ Less

Submitted 7 May, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2312.09534

arXiv:2403.12327 [pdf, other]

GT-Rain Single Image Deraining Challenge Report

Authors: Howard Zhang, Yunhao Ba, Ethan Yang, Rishi Upadhyay, Alex Wong, Achuta Kadambi, Yun Guo, Xueyao Xiao, Xiaoxiong Wang, Yi Li, Yi Chang, Luxin Yan, Chaochao Zheng, Luping Wang, Bin Liu, Sunder Ali Khowaja, Jiseok Yoon, Ik-Hyun Lee, Zhao Zhang, Yanyan Wei, Jiahuan Ren, Suiyi Zhao, Huan Zheng

Abstract: This report reviews the results of the GT-Rain challenge on single image deraining at the UG2+ workshop at CVPR 2023. The aim of this competition is to study the rainy weather phenomenon in real world scenarios, provide a novel real world rainy image dataset, and to spark innovative ideas that will further the development of single image deraining methods on real images. Submissions were trained o… ▽ More This report reviews the results of the GT-Rain challenge on single image deraining at the UG2+ workshop at CVPR 2023. The aim of this competition is to study the rainy weather phenomenon in real world scenarios, provide a novel real world rainy image dataset, and to spark innovative ideas that will further the development of single image deraining methods on real images. Submissions were trained on the GT-Rain dataset and evaluated on an extension of the dataset consisting of 15 additional scenes. Scenes in GT-Rain are comprised of real rainy image and ground truth image captured moments after the rain had stopped. 275 participants were registered in the challenge and 55 competed in the final testing phase. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2312.17234 [pdf, other]

Personalized Restoration via Dual-Pivot Tuning

Authors: Pradyumna Chari, Sizhuo Ma, Daniil Ostashev, Achuta Kadambi, Gurunandan Krishnan, Jian Wang, Kfir Aberman

Abstract: Generative diffusion models can serve as a prior which ensures that solutions of image restoration systems adhere to the manifold of natural images. However, for restoring facial images, a personalized prior is necessary to accurately represent and reconstruct unique facial features of a given individual. In this paper, we propose a simple, yet effective, method for personalized restoration, calle… ▽ More Generative diffusion models can serve as a prior which ensures that solutions of image restoration systems adhere to the manifold of natural images. However, for restoring facial images, a personalized prior is necessary to accurately represent and reconstruct unique facial features of a given individual. In this paper, we propose a simple, yet effective, method for personalized restoration, called Dual-Pivot Tuning - a two-stage approach that personalize a blind restoration system while maintaining the integrity of the general prior and the distinct role of each component. Our key observation is that for optimal personalization, the generative model should be tuned around a fixed text pivot, while the guiding network should be tuned in a generic (non-personalized) manner, using the personalized generative model as a fixed ``pivot". This approach ensures that personalization does not interfere with the restoration process, resulting in a natural appearance with high fidelity to the person's identity and the attributes of the degraded image. We evaluated our approach both qualitatively and quantitatively through extensive experiments with images of widely recognized individuals, comparing it against relevant baselines. Surprisingly, we found that our personalized prior not only achieves higher fidelity to identity with respect to the person's identity, but also outperforms state-of-the-art generic priors in terms of general image quality. Project webpage: https://personalized-restoration.github.io △ Less

Submitted 28 December, 2023; originally announced December 2023.

arXiv:2312.09534 [pdf, other]

WeatherProof: A Paired-Dataset Approach to Semantic Segmentation in Adverse Weather

Authors: Blake Gella, Howard Zhang, Rishi Upadhyay, Tiffany Chang, Matthew Waliman, Yunhao Ba, Alex Wong, Achuta Kadambi

Abstract: The introduction of large, foundational models to computer vision has led to drastically improved performance on the task of semantic segmentation. However, these existing methods exhibit a large performance drop when testing on images degraded by weather conditions such as rain, fog, or snow. We introduce a general paired-training method that can be applied to all current foundational model archi… ▽ More The introduction of large, foundational models to computer vision has led to drastically improved performance on the task of semantic segmentation. However, these existing methods exhibit a large performance drop when testing on images degraded by weather conditions such as rain, fog, or snow. We introduce a general paired-training method that can be applied to all current foundational model architectures that leads to improved performance on images in adverse weather conditions. To this end, we create the WeatherProof Dataset, the first semantic segmentation dataset with accurate clear and adverse weather image pairs, which not only enables our new training paradigm, but also improves the evaluation of the performance gap between clear and degraded segmentation. We find that training on these paired clear and adverse weather frames which share an underlying scene results in improved performance on adverse weather data. With this knowledge, we propose a training pipeline which accentuates the advantages of paired-data training using consistency losses and language guidance, which leads to performance improvements by up to 18.4% as compared to standard training procedures. △ Less

Submitted 14 December, 2023; originally announced December 2023.

arXiv:2312.04875 [pdf, other]

MVDD: Multi-View Depth Diffusion Models

Authors: Zhen Wang, Qiangeng Xu, Feitong Tan, Menglei Chai, Shichen Liu, Rohit Pandey, Sean Fanello, Achuta Kadambi, Yinda Zhang

Abstract: Denoising diffusion models have demonstrated outstanding results in 2D image generation, yet it remains a challenge to replicate its success in 3D shape generation. In this paper, we propose leveraging multi-view depth, which represents complex 3D shapes in a 2D data format that is easy to denoise. We pair this representation with a diffusion model, MVDD, that is capable of generating high-quality… ▽ More Denoising diffusion models have demonstrated outstanding results in 2D image generation, yet it remains a challenge to replicate its success in 3D shape generation. In this paper, we propose leveraging multi-view depth, which represents complex 3D shapes in a 2D data format that is easy to denoise. We pair this representation with a diffusion model, MVDD, that is capable of generating high-quality dense point clouds with 20K+ points with fine-grained details. To enforce 3D consistency in multi-view depth, we introduce an epipolar line segment attention that conditions the denoising step for a view on its neighboring views. Additionally, a depth fusion module is incorporated into diffusion steps to further ensure the alignment of depth maps. When augmented with surface reconstruction, MVDD can also produce high-quality 3D meshes. Furthermore, MVDD stands out in other tasks such as depth completion, and can serve as a 3D prior, significantly boosting many downstream tasks, such as GAN inversion. State-of-the-art results from extensive experiments demonstrate MVDD's excellent ability in 3D shape generation, depth completion, and its potential as a 3D prior for downstream tasks. △ Less

Submitted 19 December, 2023; v1 submitted 8 December, 2023; originally announced December 2023.

arXiv:2312.03203 [pdf, other]

Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields

Authors: Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Zehao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, Achuta Kadambi

Abstract: 3D scene representations have gained immense popularity in recent years. Methods that use Neural Radiance fields are versatile for traditional tasks such as novel view synthesis. In recent times, some work has emerged that aims to extend the functionality of NeRF beyond view synthesis, for semantically aware tasks such as editing and segmentation using 3D feature field distillation from 2D foundat… ▽ More 3D scene representations have gained immense popularity in recent years. Methods that use Neural Radiance fields are versatile for traditional tasks such as novel view synthesis. In recent times, some work has emerged that aims to extend the functionality of NeRF beyond view synthesis, for semantically aware tasks such as editing and segmentation using 3D feature field distillation from 2D foundation models. However, these methods have two major limitations: (a) they are limited by the rendering speed of NeRF pipelines, and (b) implicitly represented feature fields suffer from continuity artifacts reducing feature quality. Recently, 3D Gaussian Splatting has shown state-of-the-art performance on real-time radiance field rendering. In this work, we go one step further: in addition to radiance field rendering, we enable 3D Gaussian splatting on arbitrary-dimension semantic features via 2D foundation model distillation. This translation is not straightforward: naively incorporating feature fields in the 3DGS framework encounters significant challenges, notably the disparities in spatial resolution and channel consistency between RGB images and feature maps. We propose architectural and training changes to efficiently avert this problem. Our proposed method is general, and our experiments showcase novel view semantic segmentation, language-guided editing and segment anything through learning feature fields from state-of-the-art 2D foundation models such as SAM and CLIP-LSeg. Across experiments, our distillation method is able to provide comparable or better results, while being significantly faster to both train and render. Additionally, to the best of our knowledge, we are the first method to enable point and bounding-box prompting for radiance field manipulation, by leveraging the SAM model. Project website at: https://feature-3dgs.github.io/ △ Less

Submitted 8 April, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

arXiv:2312.00944 [pdf, other]

Enhancing Diffusion Models with 3D Perspective Geometry Constraints

Authors: Rishi Upadhyay, Howard Zhang, Yunhao Ba, Ethan Yang, Blake Gella, Sicheng Jiang, Alex Wong, Achuta Kadambi

Abstract: While perspective is a well-studied topic in art, it is generally taken for granted in images. However, for the recent wave of high-quality image synthesis methods such as latent diffusion models, perspective accuracy is not an explicit requirement. Since these methods are capable of outputting a wide gamut of possible images, it is difficult for these synthesized images to adhere to the principle… ▽ More While perspective is a well-studied topic in art, it is generally taken for granted in images. However, for the recent wave of high-quality image synthesis methods such as latent diffusion models, perspective accuracy is not an explicit requirement. Since these methods are capable of outputting a wide gamut of possible images, it is difficult for these synthesized images to adhere to the principles of linear perspective. We introduce a novel geometric constraint in the training process of generative models to enforce perspective accuracy. We show that outputs of models trained with this constraint both appear more realistic and improve performance of downstream models trained on generated images. Subjective human trials show that images generated with latent diffusion models trained with our constraint are preferred over images from the Stable Diffusion V2 model 70% of the time. SOTA monocular depth estimation models such as DPT and PixelFormer, fine-tuned on our images, outperform the original models trained on real images by up to 7.03% in RMSE and 19.3% in SqRel on the KITTI test set for zero-shot transfer. △ Less

Submitted 1 December, 2023; originally announced December 2023.

Comments: Project Webpage: http://visual.ee.ucla.edu/diffusionperspective.htm/

arXiv:2312.00206 [pdf, other]

SparseGS: Real-Time 360° Sparse View Synthesis using Gaussian Splatting

Authors: Haolin Xiong, Sairisheek Muttukuru, Rishi Upadhyay, Pradyumna Chari, Achuta Kadambi

Abstract: 3D Gaussian Splatting (3DGS) has recently enabled real-time rendering of unbounded 3D scenes for novel view synthesis. However, this technique requires dense training views to accurately reconstruct 3D geometry. A limited number of input views will significantly degrade reconstruction quality, resulting in artifacts such as "floaters" and "background collapse" at unseen viewpoints. In this work, w… ▽ More 3D Gaussian Splatting (3DGS) has recently enabled real-time rendering of unbounded 3D scenes for novel view synthesis. However, this technique requires dense training views to accurately reconstruct 3D geometry. A limited number of input views will significantly degrade reconstruction quality, resulting in artifacts such as "floaters" and "background collapse" at unseen viewpoints. In this work, we introduce SparseGS, an efficient training pipeline designed to address the limitations of 3DGS in scenarios with sparse training views. SparseGS incorporates depth priors, novel depth rendering techniques, and a pruning heuristic to mitigate floater artifacts, alongside an Unseen Viewpoint Regularization module to alleviate background collapses. Our extensive evaluations on the Mip-NeRF360, LLFF, and DTU datasets demonstrate that SparseGS achieves high-quality reconstruction in both unbounded and forward-facing scenarios, with as few as 12 and 3 input images, respectively, while maintaining fast training and real-time rendering capabilities. △ Less

Submitted 26 March, 2025; v1 submitted 30 November, 2023; originally announced December 2023.

Comments: Version accepted to 3DV 2025. Project page: https://github.com/ForMyCat/SparseGS

arXiv:2311.17907 [pdf, other]

CG3D: Compositional Generation for Text-to-3D via Gaussian Splatting

Authors: Alexander Vilesov, Pradyumna Chari, Achuta Kadambi

Abstract: With the onset of diffusion-based generative models and their ability to generate text-conditioned images, content generation has received a massive invigoration. Recently, these models have been shown to provide useful guidance for the generation of 3D graphics assets. However, existing work in text-conditioned 3D generation faces fundamental constraints: (i) inability to generate detailed, multi… ▽ More With the onset of diffusion-based generative models and their ability to generate text-conditioned images, content generation has received a massive invigoration. Recently, these models have been shown to provide useful guidance for the generation of 3D graphics assets. However, existing work in text-conditioned 3D generation faces fundamental constraints: (i) inability to generate detailed, multi-object scenes, (ii) inability to textually control multi-object configurations, and (iii) physically realistic scene composition. In this work, we propose CG3D, a method for compositionally generating scalable 3D assets that resolves these constraints. We find that explicit Gaussian radiance fields, parameterized to allow for compositions of objects, possess the capability to enable semantically and physically consistent scenes. By utilizing a guidance framework built around this explicit representation, we show state of the art results, capable of even exceeding the guiding diffusion model in terms of object combinations and physics accuracy. △ Less

Submitted 29 November, 2023; originally announced November 2023.

arXiv:2304.08832 [pdf, ps, other]

Improving Infrared Thermography after Solar Loading

Authors: Ellin Q. Zhao, Alexander Vilesov, Pradyumna Chari, Laleh Jalilian, Achuta Kadambi

Abstract: Widely deployed for fever screening, infrared thermometers (IRTs) enable rapid non-contact detection of body temperature, but they are inaccurate in unconstrained environments. Previous works have studied the impact of transient skin temperature on IRTs, but no studies have quantified the effect of skin temperature elevation due to absorbed solar radiation, which we call solar loading. Solar loadi… ▽ More Widely deployed for fever screening, infrared thermometers (IRTs) enable rapid non-contact detection of body temperature, but they are inaccurate in unconstrained environments. Previous works have studied the impact of transient skin temperature on IRTs, but no studies have quantified the effect of skin temperature elevation due to absorbed solar radiation, which we call solar loading. Solar loading leads to poor specificity in fever detection and is a skin tone-dependent effect, introducing inequity in IRTs. The current solution to solar loading is to have a subject reacclimate for up to 30 minutes before IRT measurement. We propose a machine learning method to improve IR thermography by removing the solar loading effect from thermal images of the face. This correction only uses a single frame of thermal data, allowing sub-second correction of skin temperature. On average, forehead skin temperature increases by 2°C after solar loading, and our machine learning model, SL-Net, not only reduces this error by 68% to 0.64°C, but also removes the positive correlation between solar loading error and melanin concentration. We open source a diverse dataset of 100 subjects with co-registered RGB-thermal images, and IRT and skin tone measurements. Our work shows that it is possible to use machine learning to correct complex thermal perturbations and enable robust and equitable human thermography. △ Less

Submitted 19 August, 2025; v1 submitted 18 April, 2023; originally announced April 2023.

arXiv:2304.03243 [pdf, other]

Synthetic Data in Healthcare

Authors: Daniel McDuff, Theodore Curran, Achuta Kadambi

Abstract: Synthetic data are becoming a critical tool for building artificially intelligent systems. Simulators provide a way of generating data systematically and at scale. These data can then be used either exclusively, or in conjunction with real data, for training and testing systems. Synthetic data are particularly attractive in cases where the availability of ``real'' training examples might be a bott… ▽ More Synthetic data are becoming a critical tool for building artificially intelligent systems. Simulators provide a way of generating data systematically and at scale. These data can then be used either exclusively, or in conjunction with real data, for training and testing systems. Synthetic data are particularly attractive in cases where the availability of ``real'' training examples might be a bottleneck. While the volume of data in healthcare is growing exponentially, creating datasets for novel tasks and/or that reflect a diverse set of conditions and causal relationships is not trivial. Furthermore, these data are highly sensitive and often patient specific. Recent research has begun to illustrate the potential for synthetic data in many areas of medicine, but no systematic review of the literature exists. In this paper, we present the cases for physical and statistical simulations for creating data and the proposed applications in healthcare and medicine. We discuss that while synthetics can promote privacy, equity, safety and continual and causal learning, they also run the risk of introducing flaws, blind spots and propagating or exaggerating biases. △ Less

Submitted 6 April, 2023; originally announced April 2023.

arXiv:2212.04096 [pdf, other]

ALTO: Alternating Latent Topologies for Implicit 3D Reconstruction

Authors: Zhen Wang, Shijie Zhou, Jeong Joon Park, Despoina Paschalidou, Suya You, Gordon Wetzstein, Leonidas Guibas, Achuta Kadambi

Abstract: This work introduces alternating latent topologies (ALTO) for high-fidelity reconstruction of implicit 3D surfaces from noisy point clouds. Previous work identifies that the spatial arrangement of latent encodings is important to recover detail. One school of thought is to encode a latent vector for each point (point latents). Another school of thought is to project point latents into a grid (grid… ▽ More This work introduces alternating latent topologies (ALTO) for high-fidelity reconstruction of implicit 3D surfaces from noisy point clouds. Previous work identifies that the spatial arrangement of latent encodings is important to recover detail. One school of thought is to encode a latent vector for each point (point latents). Another school of thought is to project point latents into a grid (grid latents) which could be a voxel grid or triplane grid. Each school of thought has tradeoffs. Grid latents are coarse and lose high-frequency detail. In contrast, point latents preserve detail. However, point latents are more difficult to decode into a surface, and quality and runtime suffer. In this paper, we propose ALTO to sequentially alternate between geometric representations, before converging to an easy-to-decode latent. We find that this preserves spatial expressiveness and makes decoding lightweight. We validate ALTO on implicit 3D recovery and observe not only a performance improvement over the state-of-the-art, but a runtime improvement of 3-10$\times$. Project website at https://visual.ee.ucla.edu/alto.htm/. △ Less

Submitted 8 December, 2022; originally announced December 2022.

arXiv:2209.00746 [pdf, other]

MIME: Minority Inclusion for Majority Group Enhancement of AI Performance

Authors: Pradyumna Chari, Yunhao Ba, Shreeram Athreya, Achuta Kadambi

Abstract: Several papers have rightly included minority groups in artificial intelligence (AI) training data to improve test inference for minority groups and/or society-at-large. A society-at-large consists of both minority and majority stakeholders. A common misconception is that minority inclusion does not increase performance for majority groups alone. In this paper, we make the surprising finding that… ▽ More Several papers have rightly included minority groups in artificial intelligence (AI) training data to improve test inference for minority groups and/or society-at-large. A society-at-large consists of both minority and majority stakeholders. A common misconception is that minority inclusion does not increase performance for majority groups alone. In this paper, we make the surprising finding that including minority samples can improve test error for the majority group. In other words, minority group inclusion leads to majority group enhancements (MIME) in performance. A theoretical existence proof of the MIME effect is presented and found to be consistent with experimental results on six different datasets. Project webpage: https://visual.ee.ucla.edu/mime.htm/ △ Less

Submitted 1 September, 2022; originally announced September 2022.

arXiv:2206.10779 [pdf, other]

Not Just Streaks: Towards Ground Truth for Single Image Deraining

Authors: Yunhao Ba, Howard Zhang, Ethan Yang, Akira Suzuki, Arnold Pfahnl, Chethan Chinder Chandrappa, Celso de Melo, Suya You, Stefano Soatto, Alex Wong, Achuta Kadambi

Abstract: We propose a large-scale dataset of real-world rainy and clean image pairs and a method to remove degradations, induced by rain streaks and rain accumulation, from the image. As there exists no real-world dataset for deraining, current state-of-the-art methods rely on synthetic data and thus are limited by the sim2real domain gap; moreover, rigorous evaluation remains a challenge due to the absenc… ▽ More We propose a large-scale dataset of real-world rainy and clean image pairs and a method to remove degradations, induced by rain streaks and rain accumulation, from the image. As there exists no real-world dataset for deraining, current state-of-the-art methods rely on synthetic data and thus are limited by the sim2real domain gap; moreover, rigorous evaluation remains a challenge due to the absence of a real paired dataset. We fill this gap by collecting a real paired deraining dataset through meticulous control of non-rain variations. Our dataset enables paired training and quantitative evaluation for diverse real-world rain phenomena (e.g. rain streaks and rain accumulation). To learn a representation robust to rain phenomena, we propose a deep neural network that reconstructs the underlying scene by minimizing a rain-robust loss between rainy and clean images. Extensive experiments demonstrate that our model outperforms the state-of-the-art deraining methods on real rainy images under various conditions. Project website: https://visual.ee.ucla.edu/gt_rain.htm/. △ Less

Submitted 29 July, 2024; v1 submitted 21 June, 2022; originally announced June 2022.

arXiv:2109.13488 [pdf, other]

Towards Rotation Invariance in Object Detection

Authors: Agastya Kalra, Guy Stoppi, Bradley Brown, Rishav Agarwal, Achuta Kadambi

Abstract: Rotation augmentations generally improve a model's invariance/equivariance to rotation - except in object detection. In object detection the shape is not known, therefore rotation creates a label ambiguity. We show that the de-facto method for bounding box label rotation, the Largest Box Method, creates very large labels, leading to poor performance and in many cases worse performance than using n… ▽ More Rotation augmentations generally improve a model's invariance/equivariance to rotation - except in object detection. In object detection the shape is not known, therefore rotation creates a label ambiguity. We show that the de-facto method for bounding box label rotation, the Largest Box Method, creates very large labels, leading to poor performance and in many cases worse performance than using no rotation at all. We propose a new method of rotation augmentation that can be implemented in a few lines of code. First, we create a differentiable approximation of label accuracy and show that axis-aligning the bounding box around an ellipse is optimal. We then introduce Rotation Uncertainty (RU) Loss, allowing the model to adapt to the uncertainty of the labels. On five different datasets (including COCO, PascalVOC, and Transparent Object Bin Picking), this approach improves the rotational invariance of both one-stage and two-stage architectures when measured with AP, AP50, and AP75. The code is available at https://github.com/akasha-imaging/ICCV2021. △ Less

Submitted 30 September, 2021; v1 submitted 28 September, 2021; originally announced September 2021.

Comments: Accepted ICCV 2021

arXiv:2109.05959 [pdf]

Physics-AI Symbiosis

Authors: Bahram Jalali, Achuta Kadambi, Vwani Roychowdhury

Abstract: The phenomenal success of physics in explaining nature and designing hardware is predicated on efficient computational models. A universal codebook of physical laws defines the computational rules and a physical system is an interacting ensemble governed by these rules. Led by deep neural networks, artificial intelligence (AI) has introduced an alternate end-to-end data-driven computational framew… ▽ More The phenomenal success of physics in explaining nature and designing hardware is predicated on efficient computational models. A universal codebook of physical laws defines the computational rules and a physical system is an interacting ensemble governed by these rules. Led by deep neural networks, artificial intelligence (AI) has introduced an alternate end-to-end data-driven computational framework, with astonishing performance gains in image classification and speech recognition and fueling hopes for a novel approach to discovering physics itself. These gains, however, come at the expense of interpretability and also computational efficiency; a trend that is on a collision course with the expected end of semiconductor scaling known as the Moore's Law. With focus on photonic applications, this paper argues how an emerging symbiosis of physics and artificial intelligence can overcome such formidable challenges, thereby not only extending the latter's spectacular rise but also transforming the direction of physical science. △ Less

Submitted 10 September, 2021; originally announced September 2021.

arXiv:2106.06007 [pdf, other]

Overcoming Difficulty in Obtaining Dark-skinned Subjects for Remote-PPG by Synthetic Augmentation

Authors: Yunhao Ba, Zhen Wang, Kerim Doruk Karinca, Oyku Deniz Bozkurt, Achuta Kadambi

Abstract: Camera-based remote photoplethysmography (rPPG) provides a non-contact way to measure physiological signals (e.g., heart rate) using facial videos. Recent deep learning architectures have improved the accuracy of such physiological measurement significantly, yet they are restricted by the diversity of the annotated videos. The existing datasets MMSE-HR, AFRL, and UBFC-RPPG contain roughly 10%, 0%,… ▽ More Camera-based remote photoplethysmography (rPPG) provides a non-contact way to measure physiological signals (e.g., heart rate) using facial videos. Recent deep learning architectures have improved the accuracy of such physiological measurement significantly, yet they are restricted by the diversity of the annotated videos. The existing datasets MMSE-HR, AFRL, and UBFC-RPPG contain roughly 10%, 0%, and 5% of dark-skinned subjects respectively. The unbalanced training sets result in a poor generalization capability to unseen subjects and lead to unwanted bias toward different demographic groups. In Western academia, it is regrettably difficult in a university setting to collect data on these dark-skinned subjects. Here we show a first attempt to overcome the lack of dark-skinned subjects by synthetic augmentation. A joint optimization framework is utilized to translate real videos from light-skinned subjects to dark skin tones while retaining their pulsatile signals. In the experiment, our method exhibits around 31% reduction in mean absolute error for the dark-skinned group and 46% improvement on bias mitigation for all the groups, as compared with the previous work trained with just real samples. △ Less

Submitted 10 June, 2021; originally announced June 2021.

arXiv:2010.12769 [pdf]

Diverse R-PPG: Camera-Based Heart Rate Estimation for Diverse Subject Skin-Tones and Scenes

Authors: Pradyumna Chari, Krish Kabra, Doruk Karinca, Soumyarup Lahiri, Diplav Srivastava, Kimaya Kulkarni, Tianyuan Chen, Maxime Cannesson, Laleh Jalilian, Achuta Kadambi

Abstract: Heart rate (HR) is an essential clinical measure for the assessment of cardiorespiratory instability. Since communities of color are disproportionately affected by both COVID-19 and cardiovascular disease, there is a pressing need to deploy contactless HR sensing solutions for high-quality telemedicine evaluations. Existing computer vision methods that estimate HR from facial videos exhibit biased… ▽ More Heart rate (HR) is an essential clinical measure for the assessment of cardiorespiratory instability. Since communities of color are disproportionately affected by both COVID-19 and cardiovascular disease, there is a pressing need to deploy contactless HR sensing solutions for high-quality telemedicine evaluations. Existing computer vision methods that estimate HR from facial videos exhibit biased performance against dark skin tones. We present a novel physics-driven algorithm that boosts performance on darker skin tones in our reported data. We assess the performance of our method through the creation of the first telemedicine-focused remote vital signs dataset, the VITAL dataset. 432 videos (~864 minutes) of 54 subjects with diverse skin tones are recorded under realistic scene conditions with corresponding vital sign data. Our method reduces errors due to lighting changes, shadows, and specular highlights and imparts unbiased performance gains across skin tones, setting the stage for making medically inclusive non-contact HR sensing technologies a viable reality for patients of all skin tones. △ Less

Submitted 9 December, 2020; v1 submitted 24 October, 2020; originally announced October 2020.

Comments: 49 pages, 6 figures, 3 tables, Supplement with 7 figures

arXiv:1911.12906 [pdf]

Enhancing Passive Non-Line-of-Sight Imaging Using Polarization Cues

Authors: Kenichiro Tanaka, Yasuhiro Mukaigawa, Achuta Kadambi

Abstract: This paper presents a method of passive non-line-of-sight (NLOS) imaging using polarization cues. A key observation is that the oblique light has a different polarimetric signal. It turns out this effect is due to the polarization axis rotation, a phenomena which can be used to better condition the light transport matrix for non-line-of-sight imaging. Our analysis and results show that the use of… ▽ More This paper presents a method of passive non-line-of-sight (NLOS) imaging using polarization cues. A key observation is that the oblique light has a different polarimetric signal. It turns out this effect is due to the polarization axis rotation, a phenomena which can be used to better condition the light transport matrix for non-line-of-sight imaging. Our analysis and results show that the use of a polarizer in front of the camera is not only a separate technique, but it can be seen as an enhancement technique for more advanced forms of passive NLOS imaging. For example, this paper shows that polarization can enhance passive NLOS imaging both with and without occluders. In all tested cases, despite the light attenuation from polarization optics, recovery of the occluded images is improved. △ Less

Submitted 28 November, 2019; originally announced November 2019.

arXiv:1911.11893 [pdf, other]

Visual Physics: Discovering Physical Laws from Videos

Authors: Pradyumna Chari, Chinmay Talegaonkar, Yunhao Ba, Achuta Kadambi

Abstract: In this paper, we teach a machine to discover the laws of physics from video streams. We assume no prior knowledge of physics, beyond a temporal stream of bounding boxes. The problem is very difficult because a machine must learn not only a governing equation (e.g. projectile motion) but also the existence of governing parameters (e.g. velocities). We evaluate our ability to discover physical laws… ▽ More In this paper, we teach a machine to discover the laws of physics from video streams. We assume no prior knowledge of physics, beyond a temporal stream of bounding boxes. The problem is very difficult because a machine must learn not only a governing equation (e.g. projectile motion) but also the existence of governing parameters (e.g. velocities). We evaluate our ability to discover physical laws on videos of elementary physical phenomena, such as projectile motion or circular motion. These elementary tasks have textbook governing equations and enable ground truth verification of our approach. △ Less

Submitted 26 November, 2019; originally announced November 2019.

arXiv:1910.00201 [pdf, other]

Blending Diverse Physical Priors with Neural Networks

Authors: Yunhao Ba, Guangyuan Zhao, Achuta Kadambi

Abstract: Machine learning in context of physical systems merits a re-examination of the learning strategy. In addition to data, one can leverage a vast library of physical prior models (e.g. kinematics, fluid flow, etc) to perform more robust inference. The nascent sub-field of \emph{physics-based learning} (PBL) studies the blending of neural networks with physical priors. While previous PBL algorithms ha… ▽ More Machine learning in context of physical systems merits a re-examination of the learning strategy. In addition to data, one can leverage a vast library of physical prior models (e.g. kinematics, fluid flow, etc) to perform more robust inference. The nascent sub-field of \emph{physics-based learning} (PBL) studies the blending of neural networks with physical priors. While previous PBL algorithms have been applied successfully to specific tasks, it is hard to generalize existing PBL methods to a wide range of physics-based problems. Such generalization would require an architecture that can adapt to variations in the correctness of the physics, or in the quality of training data. No such architecture exists. In this paper, we aim to generalize PBL, by making a first attempt to bring neural architecture search (NAS) to the realm of PBL. We introduce a new method known as physics-based neural architecture search (PhysicsNAS) that is a top-performer across a diverse range of quality in the physical model and the dataset. △ Less

Submitted 1 October, 2019; originally announced October 2019.

arXiv:1903.10210 [pdf, other]

Deep Shape from Polarization

Authors: Yunhao Ba, Alex Ross Gilbert, Franklin Wang, Jinfa Yang, Rui Chen, Yiqin Wang, Lei Yan, Boxin Shi, Achuta Kadambi

Abstract: This paper makes a first attempt to bring the Shape from Polarization (SfP) problem to the realm of deep learning. The previous state-of-the-art methods for SfP have been purely physics-based. We see value in these principled models, and blend these physical models as priors into a neural network architecture. This proposed approach achieves results that exceed the previous state-of-the-art on a c… ▽ More This paper makes a first attempt to bring the Shape from Polarization (SfP) problem to the realm of deep learning. The previous state-of-the-art methods for SfP have been purely physics-based. We see value in these principled models, and blend these physical models as priors into a neural network architecture. This proposed approach achieves results that exceed the previous state-of-the-art on a challenging dataset we introduce. This dataset consists of polarization images taken over a range of object textures, paints, and lighting conditions. We report that our proposed method achieves the lowest test error on each tested condition in our dataset, showing the value of blending data-driven and physics-driven approaches. △ Less

Submitted 25 May, 2020; v1 submitted 25 March, 2019; originally announced March 2019.

arXiv:1605.02066 [pdf, other]

Shape from Mixed Polarization

Authors: Vage Taamazyan, Achuta Kadambi, Ramesh Raskar

Abstract: Shape from Polarization (SfP) estimates surface normals using photos captured at different polarizer rotations. Fundamentally, the SfP model assumes that light is reflected either diffusely or specularly. However, this model is not valid for many real-world surfaces exhibiting a mixture of diffuse and specular properties. To address this challenge, previous methods have used a sequential solution:… ▽ More Shape from Polarization (SfP) estimates surface normals using photos captured at different polarizer rotations. Fundamentally, the SfP model assumes that light is reflected either diffusely or specularly. However, this model is not valid for many real-world surfaces exhibiting a mixture of diffuse and specular properties. To address this challenge, previous methods have used a sequential solution: first, use an existing algorithm to separate the scene into diffuse and specular components, then apply the appropriate SfP model. In this paper, we propose a new method that jointly uses viewpoint and polarization data to holistically separate diffuse and specular components, recover refractive index, and ultimately recover 3D shape. By involving the physics of polarization in the separation process, we demonstrate competitive results with a benchmark method, while recovering additional information (e.g. refractive index). △ Less

Submitted 11 June, 2016; v1 submitted 5 May, 2016; originally announced May 2016.

Comments: 13 pages, 5 figures

arXiv:1503.01804 [pdf, other]

Frequency Domain TOF: Encoding Object Depth in Modulation Frequency

Authors: Achuta Kadambi, Vage Taamazyan, Suren Jayasuriya, Ramesh Raskar

Abstract: Time of flight cameras may emerge as the 3-D sensor of choice. Today, time of flight sensors use phase-based sampling, where the phase delay between emitted and received, high-frequency signals encodes distance. In this paper, we present a new time of flight architecture that relies only on frequency---we refer to this technique as frequency-domain time of flight (FD-TOF). Inspired by optical cohe… ▽ More Time of flight cameras may emerge as the 3-D sensor of choice. Today, time of flight sensors use phase-based sampling, where the phase delay between emitted and received, high-frequency signals encodes distance. In this paper, we present a new time of flight architecture that relies only on frequency---we refer to this technique as frequency-domain time of flight (FD-TOF). Inspired by optical coherence tomography (OCT), FD-TOF excels when frequency bandwidth is high. With the increasing frequency of TOF sensors, new challenges to time of flight sensing continue to emerge. At high frequencies, FD-TOF offers several potential benefits over phase-based time of flight methods. △ Less

Submitted 5 March, 2015; originally announced March 2015.

Comments: 10 pages

arXiv:1501.04878

A Light Transport Model for Mitigating Multipath Interference in TOF Sensors

Authors: Nikhil Naik, Achuta Kadambi, Christoph Rhemann, Shahram Izadi, Ramesh Raskar, Sing Bing Kang

Abstract: Continuous-wave Time-of-flight (TOF) range imaging has become a commercially viable technology with many applications in computer vision and graphics. However, the depth images obtained from TOF cameras contain scene dependent errors due to multipath interference (MPI). Specifically, MPI occurs when multiple optical reflections return to a single spatial location on the imaging sensor. Many prior… ▽ More Continuous-wave Time-of-flight (TOF) range imaging has become a commercially viable technology with many applications in computer vision and graphics. However, the depth images obtained from TOF cameras contain scene dependent errors due to multipath interference (MPI). Specifically, MPI occurs when multiple optical reflections return to a single spatial location on the imaging sensor. Many prior approaches to rectifying MPI rely on sparsity in optical reflections, which is an extreme simplification. In this paper, we correct MPI by combining the standard measurements from a TOF camera with information from direct and global light transport. We report results on both simulated experiments and physical experiments (using the Kinect sensor). Our results, evaluated against ground truth, demonstrate a quantitative improvement in depth accuracy. △ Less

Submitted 30 January, 2015; v1 submitted 20 January, 2015; originally announced January 2015.

Comments: This paper has been withdrawn by the submitter as the submission was made due to a miscommunication

arXiv:1404.1116 [pdf, other]

doi 10.1364/OL.39.001705

Resolving Multi-path Interference in Time-of-Flight Imaging via Modulation Frequency Diversity and Sparse Regularization

Authors: Ayush Bhandari, Achuta Kadambi, Refael Whyte, Christopher Barsi, Micha Feigin, Adrian Dorrington, Ramesh Raskar

Abstract: Time-of-flight (ToF) cameras calculate depth maps by reconstructing phase shifts of amplitude-modulated signals. For broad illumination or transparent objects, reflections from multiple scene points can illuminate a given pixel, giving rise to an erroneous depth map. We report here a sparsity regularized solution that separates K-interfering components using multiple modulation frequency measureme… ▽ More Time-of-flight (ToF) cameras calculate depth maps by reconstructing phase shifts of amplitude-modulated signals. For broad illumination or transparent objects, reflections from multiple scene points can illuminate a given pixel, giving rise to an erroneous depth map. We report here a sparsity regularized solution that separates K-interfering components using multiple modulation frequency measurements. The method maps ToF imaging to the general framework of spectral estimation theory and has applications in improving depth profiles and exploiting multiple scattering. △ Less

Submitted 3 April, 2014; originally announced April 2014.

Comments: 11 Pages, 4 figures, appeared with minor changes in Optics Letters

Journal ref: Optics Letters, Vol. 39, Issue 6, pp. 1705-1708 (2014)

Showing 1–41 of 41 results for author: Kadambi, A