-
Fit Pixels, Get Labels: Meta-learned Implicit Networks for Image Segmentation
Authors:
Kushal Vyas,
Ashok Veeraraghavan,
Guha Balakrishnan
Abstract:
Implicit neural representations (INRs) have achieved remarkable successes in learning expressive yet compact signal representations. However, they are not naturally amenable to predictive tasks such as segmentation, where they must learn semantic structures over a distribution of signals. In this study, we introduce MetaSeg, a meta-learning framework to train INRs for medical image segmentation. M…
▽ More
Implicit neural representations (INRs) have achieved remarkable successes in learning expressive yet compact signal representations. However, they are not naturally amenable to predictive tasks such as segmentation, where they must learn semantic structures over a distribution of signals. In this study, we introduce MetaSeg, a meta-learning framework to train INRs for medical image segmentation. MetaSeg uses an underlying INR that simultaneously predicts per pixel intensity values and class labels. It then uses a meta-learning procedure to find optimal initial parameters for this INR over a training dataset of images and segmentation maps, such that the INR can simply be fine-tuned to fit pixels of an unseen test image, and automatically decode its class labels. We evaluated MetaSeg on 2D and 3D brain MRI segmentation tasks and report Dice scores comparable to commonly used U-Net models, but with $90\%$ fewer parameters. MetaSeg offers a fresh, scalable alternative to traditional resource-heavy architectures such as U-Nets and vision transformers for medical image segmentation. Our project is available at https://kushalvyas.github.io/metaseg.html .
△ Less
Submitted 5 October, 2025;
originally announced October 2025.
-
COMPASS: Robust Feature Conformal Prediction for Medical Segmentation Metrics
Authors:
Matt Y. Cheung,
Ashok Veeraraghavan,
Guha Balakrishnan
Abstract:
In clinical applications, the utility of segmentation models is often based on the accuracy of derived downstream metrics such as organ size, rather than by the pixel-level accuracy of the segmentation masks themselves. Thus, uncertainty quantification for such metrics is crucial for decision-making. Conformal prediction (CP) is a popular framework to derive such principled uncertainty guarantees,…
▽ More
In clinical applications, the utility of segmentation models is often based on the accuracy of derived downstream metrics such as organ size, rather than by the pixel-level accuracy of the segmentation masks themselves. Thus, uncertainty quantification for such metrics is crucial for decision-making. Conformal prediction (CP) is a popular framework to derive such principled uncertainty guarantees, but applying CP naively to the final scalar metric is inefficient because it treats the complex, non-linear segmentation-to-metric pipeline as a black box. We introduce COMPASS, a practical framework that generates efficient, metric-based CP intervals for image segmentation models by leveraging the inductive biases of their underlying deep neural networks. COMPASS performs calibration directly in the model's representation space by perturbing intermediate features along low-dimensional subspaces maximally sensitive to the target metric. We prove that COMPASS achieves valid marginal coverage under exchangeability and nestedness assumptions. Empirically, we demonstrate that COMPASS produces significantly tighter intervals than traditional CP baselines on four medical image segmentation tasks for area estimation of skin lesions and anatomical structures. Furthermore, we show that leveraging learned internal features to estimate importance weights allows COMPASS to also recover target coverage under covariate shifts. COMPASS paves the way for practical, metric-based uncertainty quantification for medical image segmentation.
△ Less
Submitted 26 September, 2025;
originally announced September 2025.
-
FastAvatar: Instant 3D Gaussian Splatting for Faces from Single Unconstrained Poses
Authors:
Hao Liang,
Zhixuan Ge,
Ashish Tiwari,
Soumendu Majee,
G. M. Dilshan Godaliyadda,
Ashok Veeraraghavan,
Guha Balakrishnan
Abstract:
We present FastAvatar, a pose-invariant, feed-forward framework that can generate a 3D Gaussian Splatting (3DGS) model from a single face image from an arbitrary pose in near-instant time (<10ms). FastAvatar uses a novel encoder-decoder neural network design to achieve both fast fitting and identity preservation regardless of input pose. First, FastAvatar constructs a 3DGS face ``template'' model…
▽ More
We present FastAvatar, a pose-invariant, feed-forward framework that can generate a 3D Gaussian Splatting (3DGS) model from a single face image from an arbitrary pose in near-instant time (<10ms). FastAvatar uses a novel encoder-decoder neural network design to achieve both fast fitting and identity preservation regardless of input pose. First, FastAvatar constructs a 3DGS face ``template'' model from a training dataset of faces with multi-view captures. Second, FastAvatar encodes the input face image into an identity-specific and pose-invariant latent embedding, and decodes this embedding to predict residuals to the structural and appearance parameters of each Gaussian in the template 3DGS model. By only inferring residuals in a feed-forward fashion, model inference is fast and robust. FastAvatar significantly outperforms existing feed-forward face 3DGS methods (e.g., GAGAvatar) in reconstruction quality, and runs 1000x faster than per-face optimization methods (e.g., FlashAvatar, GaussianAvatars and GASP). In addition, FastAvatar's novel latent space design supports real-time identity interpolation and attribute editing which is not possible with any existing feed-forward 3DGS face generation framework. FastAvatar's combination of excellent reconstruction quality and speed expands the scope of 3DGS for photorealistic avatar applications in consumer and interactive systems.
△ Less
Submitted 25 August, 2025;
originally announced August 2025.
-
Post-Hurricane Debris Segmentation Using Fine-Tuned Foundational Vision Models
Authors:
Kooshan Amini,
Yuhao Liu,
Jamie Ellen Padgett,
Guha Balakrishnan,
Ashok Veeraraghavan
Abstract:
Timely and accurate detection of hurricane debris is critical for effective disaster response and community resilience. While post-disaster aerial imagery is readily available, robust debris segmentation solutions applicable across multiple disaster regions remain limited. Developing a generalized solution is challenging due to varying environmental and imaging conditions that alter debris' visual…
▽ More
Timely and accurate detection of hurricane debris is critical for effective disaster response and community resilience. While post-disaster aerial imagery is readily available, robust debris segmentation solutions applicable across multiple disaster regions remain limited. Developing a generalized solution is challenging due to varying environmental and imaging conditions that alter debris' visual signatures across different regions, further compounded by the scarcity of training data. This study addresses these challenges by fine-tuning pre-trained foundational vision models, achieving robust performance with a relatively small, high-quality dataset. Specifically, this work introduces an open-source dataset comprising approximately 1,200 manually annotated aerial RGB images from Hurricanes Ian, Ida, and Ike. To mitigate human biases and enhance data quality, labels from multiple annotators are strategically aggregated and visual prompt engineering is employed. The resulting fine-tuned model, named fCLIPSeg, achieves a Dice score of 0.70 on data from Hurricane Ida -- a disaster event entirely excluded during training -- with virtually no false positives in debris-free areas. This work presents the first event-agnostic debris segmentation model requiring only standard RGB imagery during deployment, making it well-suited for rapid, large-scale post-disaster impact assessments and recovery planning.
△ Less
Submitted 18 April, 2025; v1 submitted 16 April, 2025;
originally announced April 2025.
-
TranSplat: Lighting-Consistent Cross-Scene Object Transfer with 3D Gaussian Splatting
Authors:
Tony Yu,
Yanlin Jin,
Ashok Veeraraghavan,
Akshat Dave,
Guha Balakrishnan
Abstract:
We present TranSplat, a 3D scene rendering algorithm that enables realistic cross-scene object transfer (from a source to a target scene) based on the Gaussian Splatting framework. Our approach addresses two critical challenges: (1) precise 3D object extraction from the source scene, and (2) faithful relighting of the transferred object in the target scene without explicit material property estima…
▽ More
We present TranSplat, a 3D scene rendering algorithm that enables realistic cross-scene object transfer (from a source to a target scene) based on the Gaussian Splatting framework. Our approach addresses two critical challenges: (1) precise 3D object extraction from the source scene, and (2) faithful relighting of the transferred object in the target scene without explicit material property estimation. TranSplat fits a splatting model to the source scene, using 2D object masks to drive fine-grained 3D segmentation. Following user-guided insertion of the object into the target scene, along with automatic refinement of position and orientation, TranSplat derives per-Gaussian radiance transfer functions via spherical harmonic analysis to adapt the object's appearance to match the target scene's lighting environment. This relighting strategy does not require explicitly estimating physical scene properties such as BRDFs. Evaluated on several synthetic and real-world scenes and objects, TranSplat yields excellent 3D object extractions and relighting performance compared to recent baseline methods and visually convincing cross-scene object transfers. We conclude by discussing the limitations of the approach.
△ Less
Submitted 7 May, 2025; v1 submitted 28 March, 2025;
originally announced March 2025.
-
When are Diffusion Priors Helpful in Sparse Reconstruction? A Study with Sparse-view CT
Authors:
Matt Y. Cheung,
Sophia Zorek,
Tucker J. Netherton,
Laurence E. Court,
Sadeer Al-Kindi,
Ashok Veeraraghavan,
Guha Balakrishnan
Abstract:
Diffusion models demonstrate state-of-the-art performance on image generation, and are gaining traction for sparse medical image reconstruction tasks. However, compared to classical reconstruction algorithms relying on simple analytical priors, diffusion models have the dangerous property of producing realistic looking results \emph{even when incorrect}, particularly with few observations. We inve…
▽ More
Diffusion models demonstrate state-of-the-art performance on image generation, and are gaining traction for sparse medical image reconstruction tasks. However, compared to classical reconstruction algorithms relying on simple analytical priors, diffusion models have the dangerous property of producing realistic looking results \emph{even when incorrect}, particularly with few observations. We investigate the utility of diffusion models as priors for image reconstruction by varying the number of observations and comparing their performance to classical priors (sparse and Tikhonov regularization) using pixel-based, structural, and downstream metrics. We make comparisons on low-dose chest wall computed tomography (CT) for fat mass quantification. First, we find that classical priors are superior to diffusion priors when the number of projections is ``sufficient''. Second, we find that diffusion priors can capture a large amount of detail with very few observations, significantly outperforming classical priors. However, they fall short of capturing all details, even with many observations. Finally, we find that the performance of diffusion priors plateau after extremely few ($\approx$10-15) projections. Ultimately, our work highlights potential issues with diffusion-based sparse reconstruction and underscores the importance of further investigation, particularly in high-stakes clinical settings.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
Video-based Surgical Tool-tip and Keypoint Tracking using Multi-frame Context-driven Deep Learning Models
Authors:
Bhargav Ghanekar,
Lianne R. Johnson,
Jacob L. Laughlin,
Marcia K. O'Malley,
Ashok Veeraraghavan
Abstract:
Automated tracking of surgical tool keypoints in robotic surgery videos is an essential task for various downstream use cases such as skill assessment, expertise assessment, and the delineation of safety zones. In recent years, the explosion of deep learning for vision applications has led to many works in surgical instrument segmentation, while lesser focus has been on tracking specific tool keyp…
▽ More
Automated tracking of surgical tool keypoints in robotic surgery videos is an essential task for various downstream use cases such as skill assessment, expertise assessment, and the delineation of safety zones. In recent years, the explosion of deep learning for vision applications has led to many works in surgical instrument segmentation, while lesser focus has been on tracking specific tool keypoints, such as tool tips. In this work, we propose a novel, multi-frame context-driven deep learning framework to localize and track tool keypoints in surgical videos. We train and test our models on the annotated frames from the 2015 EndoVis Challenge dataset, resulting in state-of-the-art performance. By leveraging sophisticated deep learning models and multi-frame context, we achieve 90\% keypoint detection accuracy and a localization RMS error of 5.27 pixels. Results on a self-annotated JIGSAWS dataset with more challenging scenarios also show that the proposed multi-frame models can accurately track tool-tip and tool-base keypoints, with ${<}4.2$-pixel RMS error overall. Such a framework paves the way for accurately tracking surgical instrument keypoints, enabling further downstream use cases. Project and dataset webpage: https://tinyurl.com/mfc-tracker
△ Less
Submitted 30 January, 2025;
originally announced January 2025.
-
Regression Conformal Prediction under Bias
Authors:
Matt Y. Cheung,
Tucker J. Netherton,
Laurence E. Court,
Ashok Veeraraghavan,
Guha Balakrishnan
Abstract:
Uncertainty quantification is crucial to account for the imperfect predictions of machine learning algorithms for high-impact applications. Conformal prediction (CP) is a powerful framework for uncertainty quantification that generates calibrated prediction intervals with valid coverage. In this work, we study how CP intervals are affected by bias - the systematic deviation of a prediction from gr…
▽ More
Uncertainty quantification is crucial to account for the imperfect predictions of machine learning algorithms for high-impact applications. Conformal prediction (CP) is a powerful framework for uncertainty quantification that generates calibrated prediction intervals with valid coverage. In this work, we study how CP intervals are affected by bias - the systematic deviation of a prediction from ground truth values - a phenomenon prevalent in many real-world applications. We investigate the influence of bias on interval lengths of two different types of adjustments -- symmetric adjustments, the conventional method where both sides of the interval are adjusted equally, and asymmetric adjustments, a more flexible method where the interval can be adjusted unequally in positive or negative directions. We present theoretical and empirical analyses characterizing how symmetric and asymmetric adjustments impact the "tightness" of CP intervals for regression tasks. Specifically for absolute residual and quantile-based non-conformity scores, we prove: 1) the upper bound of symmetrically adjusted interval lengths increases by $2|b|$ where $b$ is a globally applied scalar value representing bias, 2) asymmetrically adjusted interval lengths are not affected by bias, and 3) conditions when asymmetrically adjusted interval lengths are guaranteed to be smaller than symmetric ones. Our analyses suggest that even if predictions exhibit significant drift from ground truth values, asymmetrically adjusted intervals are still able to maintain the same tightness and validity of intervals as if the drift had never happened, while symmetric ones significantly inflate the lengths. We demonstrate our theoretical results with two real-world prediction tasks: sparse-view computed tomography (CT) reconstruction and time-series weather forecasting. Our work paves the way for more bias-robust machine learning systems.
△ Less
Submitted 7 October, 2024;
originally announced October 2024.
-
Downscaling Extreme Precipitation with Wasserstein Regularized Diffusion
Authors:
Yuhao Liu,
James Doss-Gollin,
Qiushi Dai,
Ashok Veeraraghavan,
Guha Balakrishnan
Abstract:
Understanding the risks posed by extreme rainfall events requires analysis of precipitation fields with high resolution (to assess localized hazards) and extensive historical coverage (to capture sufficient examples of rare occurrences). Radar and mesonet networks provide precipitation fields at 1 km resolution but with limited historical and geographical coverage, while gauge-based records and re…
▽ More
Understanding the risks posed by extreme rainfall events requires analysis of precipitation fields with high resolution (to assess localized hazards) and extensive historical coverage (to capture sufficient examples of rare occurrences). Radar and mesonet networks provide precipitation fields at 1 km resolution but with limited historical and geographical coverage, while gauge-based records and reanalysis products cover decades of time on a global scale, but only at 30-50 km resolution. To help provide high-resolution precipitation estimates over long time scales, this study presents Wasserstein Regularized Diffusion (WassDiff), a diffusion framework to downscale (super-resolve) precipitation fields from low-resolution gauge and reanalysis products. Crucially, unlike related deep generative models, WassDiff integrates a Wasserstein distribution-matching regularizer to the denoising process to reduce empirical biases at extreme intensities. Comprehensive evaluations demonstrate that WassDiff quantitatively outperforms existing state-of-the-art generative downscaling methods at recovering extreme weather phenomena such as tropical storms and cold fronts. Case studies further qualitatively demonstrate WassDiff's ability to reproduce realistic fine-scale weather structures and accurate peak intensities. By unlocking decades of high-resolution rainfall information from globally available coarse records, WassDiff offers a practical pathway toward more accurate flood-risk assessments and climate-adaptation planning.
△ Less
Submitted 12 August, 2025; v1 submitted 1 October, 2024;
originally announced October 2024.
-
Learning Transferable Features for Implicit Neural Representations
Authors:
Kushal Vyas,
Ahmed Imtiaz Humayun,
Aniket Dashpute,
Richard G. Baraniuk,
Ashok Veeraraghavan,
Guha Balakrishnan
Abstract:
Implicit neural representations (INRs) have demonstrated success in a variety of applications, including inverse problems and neural rendering. An INR is typically trained to capture one signal of interest, resulting in learned neural features that are highly attuned to that signal. Assumed to be less generalizable, we explore the aspect of transferability of such learned neural features for fitti…
▽ More
Implicit neural representations (INRs) have demonstrated success in a variety of applications, including inverse problems and neural rendering. An INR is typically trained to capture one signal of interest, resulting in learned neural features that are highly attuned to that signal. Assumed to be less generalizable, we explore the aspect of transferability of such learned neural features for fitting similar signals. We introduce a new INR training framework, STRAINER that learns transferrable features for fitting INRs to new signals from a given distribution, faster and with better reconstruction quality. Owing to the sequential layer-wise affine operations in an INR, we propose to learn transferable representations by sharing initial encoder layers across multiple INRs with independent decoder layers. At test time, the learned encoder representations are transferred as initialization for an otherwise randomly initialized INR. We find STRAINER to yield extremely powerful initialization for fitting images from the same domain and allow for $\approx +10dB$ gain in signal quality early on compared to an untrained INR itself. STRAINER also provides a simple way to encode data-driven priors in INRs. We evaluate STRAINER on multiple in-domain and out-of-domain signal fitting tasks and inverse problems and further provide detailed analysis and discussion on the transferability of STRAINER's features. Our demo can be accessed at https://kushalvyas.github.io/strainer.html .
△ Less
Submitted 9 January, 2025; v1 submitted 14 September, 2024;
originally announced September 2024.
-
DIFR3CT: Latent Diffusion for Probabilistic 3D CT Reconstruction from Few Planar X-Rays
Authors:
Yiran Sun,
Hana Baroudi,
Tucker Netherton,
Laurence Court,
Osama Mawlawi,
Ashok Veeraraghavan,
Guha Balakrishnan
Abstract:
Computed Tomography (CT) scans are the standard-of-care for the visualization and diagnosis of many clinical ailments, and are needed for the treatment planning of external beam radiotherapy. Unfortunately, the availability of CT scanners in low- and mid-resource settings is highly variable. Planar x-ray radiography units, in comparison, are far more prevalent, but can only provide limited 2D obse…
▽ More
Computed Tomography (CT) scans are the standard-of-care for the visualization and diagnosis of many clinical ailments, and are needed for the treatment planning of external beam radiotherapy. Unfortunately, the availability of CT scanners in low- and mid-resource settings is highly variable. Planar x-ray radiography units, in comparison, are far more prevalent, but can only provide limited 2D observations of the 3D anatomy. In this work we propose DIFR3CT, a 3D latent diffusion model, that can generate a distribution of plausible CT volumes from one or few (<10) planar x-ray observations. DIFR3CT works by fusing 2D features from each x-ray into a joint 3D space, and performing diffusion conditioned on these fused features in a low-dimensional latent space. We conduct extensive experiments demonstrating that DIFR3CT is better than recent sparse CT reconstruction baselines in terms of standard pixel-level (PSNR, SSIM) on both the public LIDC and in-house post-mastectomy CT datasets. We also show that DIFR3CT supports uncertainty quantification via Monte Carlo sampling, which provides an opportunity to measure reconstruction reliability. Finally, we perform a preliminary pilot study evaluating DIFR3CT for automated breast radiotherapy contouring and planning -- and demonstrate promising feasibility. Our code is available at https://github.com/yransun/DIFR3CT.
△ Less
Submitted 27 August, 2024;
originally announced August 2024.
-
NeST: Neural Stress Tensor Tomography by leveraging 3D Photoelasticity
Authors:
Akshat Dave,
Tianyi Zhang,
Aaron Young,
Ramesh Raskar,
Wolfgang Heidrich,
Ashok Veeraraghavan
Abstract:
Photoelasticity enables full-field stress analysis in transparent objects through stress-induced birefringence. Existing techniques are limited to 2D slices and require destructively slicing the object. Recovering the internal 3D stress distribution of the entire object is challenging as it involves solving a tensor tomography problem and handling phase wrapping ambiguities. We introduce NeST, an…
▽ More
Photoelasticity enables full-field stress analysis in transparent objects through stress-induced birefringence. Existing techniques are limited to 2D slices and require destructively slicing the object. Recovering the internal 3D stress distribution of the entire object is challenging as it involves solving a tensor tomography problem and handling phase wrapping ambiguities. We introduce NeST, an analysis-by-synthesis approach for reconstructing 3D stress tensor fields as neural implicit representations from polarization measurements. Our key insight is to jointly handle phase unwrapping and tensor tomography using a differentiable forward model based on Jones calculus. Our non-linear model faithfully matches real captures, unlike prior linear approximations. We develop an experimental multi-axis polariscope setup to capture 3D photoelasticity and experimentally demonstrate that NeST reconstructs the internal stress distribution for objects with varying shape and force conditions. Additionally, we showcase novel applications in stress analysis, such as visualizing photoelastic fringes by virtually slicing the object and viewing photoelastic fringes from unseen viewpoints. NeST paves the way for scalable non-destructive 3D photoelastic analysis.
△ Less
Submitted 24 June, 2024; v1 submitted 14 June, 2024;
originally announced June 2024.
-
Streaming quanta sensors for online, high-performance imaging and vision
Authors:
Tianyi Zhang,
Matthew Dutson,
Vivek Boominathan,
Mohit Gupta,
Ashok Veeraraghavan
Abstract:
Recently quanta image sensors (QIS) -- ultra-fast, zero-read-noise binary image sensors -- have demonstrated remarkable imaging capabilities in many challenging scenarios. Despite their potential, the adoption of these sensors is severely hampered by (a) high data rates and (b) the need for new computational pipelines to handle the unconventional raw data. We introduce a simple, low-bandwidth comp…
▽ More
Recently quanta image sensors (QIS) -- ultra-fast, zero-read-noise binary image sensors -- have demonstrated remarkable imaging capabilities in many challenging scenarios. Despite their potential, the adoption of these sensors is severely hampered by (a) high data rates and (b) the need for new computational pipelines to handle the unconventional raw data. We introduce a simple, low-bandwidth computational pipeline to address these challenges. Our approach is based on a novel streaming representation with a small memory footprint, efficiently capturing intensity information at multiple temporal scales. Updating the representation requires only 16 floating-point operations/pixel, which can be efficiently computed online at the native frame rate of the binary frames. We use a neural network operating on this representation to reconstruct videos in real-time (10-30 fps). We illustrate why such representation is well-suited for these emerging sensors, and how it offers low latency and high frame rate while retaining flexibility for downstream computer vision. Our approach results in significant data bandwidth reductions ~100X and real-time image reconstruction and computer vision -- $10^4$-$10^5$ reduction in computation than existing state-of-the-art approach while maintaining comparable quality. To the best of our knowledge, our approach is the first to achieve online, real-time image reconstruction on QIS.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
Metric-Guided Conformal Bounds for Probabilistic Image Reconstruction
Authors:
Matt Y Cheung,
Tucker J Netherton,
Laurence E Court,
Ashok Veeraraghavan,
Guha Balakrishnan
Abstract:
Modern deep learning reconstruction algorithms generate impressively realistic scans from sparse inputs, but can often produce significant inaccuracies. This makes it difficult to provide statistically guaranteed claims about the true state of a subject from scans reconstructed by these algorithms. In this study, we propose a framework for computing provably valid prediction bounds on claims deriv…
▽ More
Modern deep learning reconstruction algorithms generate impressively realistic scans from sparse inputs, but can often produce significant inaccuracies. This makes it difficult to provide statistically guaranteed claims about the true state of a subject from scans reconstructed by these algorithms. In this study, we propose a framework for computing provably valid prediction bounds on claims derived from probabilistic black-box image reconstruction algorithms. The key insights behind our framework are to represent reconstructed scans with a derived clinical metric of interest, and to calibrate bounds on the ground truth metric with conformal prediction (CP) using a prior calibration dataset. These bounds convey interpretable feedback about the subject's state, and can also be used to retrieve nearest-neighbor reconstructed scans for visual inspection. We demonstrate the utility of this framework on sparse-view computed tomography (CT) for fat mass quantification and radiotherapy planning tasks. Results show that our framework produces bounds with better semantical interpretation than conventional pixel-based bounding approaches. Furthermore, we can flag dangerous outlier reconstructions that look plausible but have statistically unlikely metric values.
△ Less
Submitted 26 September, 2025; v1 submitted 23 April, 2024;
originally announced April 2024.
-
WaveMo: Learning Wavefront Modulations to See Through Scattering
Authors:
Mingyang Xie,
Haiyun Guo,
Brandon Y. Feng,
Lingbo Jin,
Ashok Veeraraghavan,
Christopher A. Metzler
Abstract:
Imaging through scattering media is a fundamental and pervasive challenge in fields ranging from medical diagnostics to astronomy. A promising strategy to overcome this challenge is wavefront modulation, which induces measurement diversity during image acquisition. Despite its importance, designing optimal wavefront modulations to image through scattering remains under-explored. This paper introdu…
▽ More
Imaging through scattering media is a fundamental and pervasive challenge in fields ranging from medical diagnostics to astronomy. A promising strategy to overcome this challenge is wavefront modulation, which induces measurement diversity during image acquisition. Despite its importance, designing optimal wavefront modulations to image through scattering remains under-explored. This paper introduces a novel learning-based framework to address the gap. Our approach jointly optimizes wavefront modulations and a computationally lightweight feedforward "proxy" reconstruction network. This network is trained to recover scenes obscured by scattering, using measurements that are modified by these modulations. The learned modulations produced by our framework generalize effectively to unseen scattering scenarios and exhibit remarkable versatility. During deployment, the learned modulations can be decoupled from the proxy network to augment other more computationally expensive restoration algorithms. Through extensive experiments, we demonstrate our approach significantly advances the state of the art in imaging through scattering media. Our project webpage is at https://wavemo-2024.github.io/.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
DecentNeRFs: Decentralized Neural Radiance Fields from Crowdsourced Images
Authors:
Zaid Tasneem,
Akshat Dave,
Abhishek Singh,
Kushagra Tiwary,
Praneeth Vepakomma,
Ashok Veeraraghavan,
Ramesh Raskar
Abstract:
Neural radiance fields (NeRFs) show potential for transforming images captured worldwide into immersive 3D visual experiences. However, most of this captured visual data remains siloed in our camera rolls as these images contain personal details. Even if made public, the problem of learning 3D representations of billions of scenes captured daily in a centralized manner is computationally intractab…
▽ More
Neural radiance fields (NeRFs) show potential for transforming images captured worldwide into immersive 3D visual experiences. However, most of this captured visual data remains siloed in our camera rolls as these images contain personal details. Even if made public, the problem of learning 3D representations of billions of scenes captured daily in a centralized manner is computationally intractable. Our approach, DecentNeRF, is the first attempt at decentralized, crowd-sourced NeRFs that require $\sim 10^4\times$ less server computing for a scene than a centralized approach. Instead of sending the raw data, our approach requires users to send a 3D representation, distributing the high computation cost of training centralized NeRFs between the users. It learns photorealistic scene representations by decomposing users' 3D views into personal and global NeRFs and a novel optimally weighted aggregation of only the latter. We validate the advantage of our approach to learn NeRFs with photorealism and minimal server computation cost on structured synthetic and real-world photo tourism datasets. We further analyze how secure aggregation of global NeRFs in DecentNeRF minimizes the undesired reconstruction of personal content by the server.
△ Less
Submitted 28 March, 2024; v1 submitted 19 March, 2024;
originally announced March 2024.
-
Passive Snapshot Coded Aperture Dual-Pixel RGB-D Imaging
Authors:
Bhargav Ghanekar,
Salman Siddique Khan,
Pranav Sharma,
Shreyas Singh,
Vivek Boominathan,
Kaushik Mitra,
Ashok Veeraraghavan
Abstract:
Passive, compact, single-shot 3D sensing is useful in many application areas such as microscopy, medical imaging, surgical navigation, and autonomous driving where form factor, time, and power constraints can exist. Obtaining RGB-D scene information over a short imaging distance, in an ultra-compact form factor, and in a passive, snapshot manner is challenging. Dual-pixel (DP) sensors are a potent…
▽ More
Passive, compact, single-shot 3D sensing is useful in many application areas such as microscopy, medical imaging, surgical navigation, and autonomous driving where form factor, time, and power constraints can exist. Obtaining RGB-D scene information over a short imaging distance, in an ultra-compact form factor, and in a passive, snapshot manner is challenging. Dual-pixel (DP) sensors are a potential solution to achieve the same. DP sensors collect light rays from two different halves of the lens in two interleaved pixel arrays, thus capturing two slightly different views of the scene, like a stereo camera system. However, imaging with a DP sensor implies that the defocus blur size is directly proportional to the disparity seen between the views. This creates a trade-off between disparity estimation vs. deblurring accuracy. To improve this trade-off effect, we propose CADS (Coded Aperture Dual-Pixel Sensing), in which we use a coded aperture in the imaging lens along with a DP sensor. In our approach, we jointly learn an optimal coded pattern and the reconstruction algorithm in an end-to-end optimization setting. Our resulting CADS imaging system demonstrates improvement of >1.5dB PSNR in all-in-focus (AIF) estimates and 5-6% in depth estimation quality over naive DP sensing for a wide range of aperture settings. Furthermore, we build the proposed CADS prototypes for DSLR photography settings and in an endoscope and a dermoscope form factor. Our novel coded dual-pixel sensing approach demonstrates accurate RGB-D reconstruction results in simulations and real-world experiments in a passive, snapshot, and compact manner.
△ Less
Submitted 30 March, 2024; v1 submitted 28 February, 2024;
originally announced February 2024.
-
ConVRT: Consistent Video Restoration Through Turbulence with Test-time Optimization of Neural Video Representations
Authors:
Haoming Cai,
Jingxi Chen,
Brandon Y. Feng,
Weiyun Jiang,
Mingyang Xie,
Kevin Zhang,
Ashok Veeraraghavan,
Christopher Metzler
Abstract:
tmospheric turbulence presents a significant challenge in long-range imaging. Current restoration algorithms often struggle with temporal inconsistency, as well as limited generalization ability across varying turbulence levels and scene content different than the training data. To tackle these issues, we introduce a self-supervised method, Consistent Video Restoration through Turbulence (ConVRT)…
▽ More
tmospheric turbulence presents a significant challenge in long-range imaging. Current restoration algorithms often struggle with temporal inconsistency, as well as limited generalization ability across varying turbulence levels and scene content different than the training data. To tackle these issues, we introduce a self-supervised method, Consistent Video Restoration through Turbulence (ConVRT) a test-time optimization method featuring a neural video representation designed to enhance temporal consistency in restoration. A key innovation of ConVRT is the integration of a pretrained vision-language model (CLIP) for semantic-oriented supervision, which steers the restoration towards sharp, photorealistic images in the CLIP latent space. We further develop a principled selection strategy of text prompts, based on their statistical correlation with a perceptual metric. ConVRT's test-time optimization allows it to adapt to a wide range of real-world turbulence conditions, effectively leveraging the insights gained from pre-trained models on simulated data. ConVRT offers a comprehensive and effective solution for mitigating real-world turbulence in dynamic videos.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Event-based Motion-Robust Accurate Shape Estimation for Mixed Reflectance Scenes
Authors:
Aniket Dashpute,
Jiazhang Wang,
James Taylor,
Oliver Cossairt,
Ashok Veeraraghavan,
Florian Willomitzer
Abstract:
Event-based structured light systems have recently been introduced as an exciting alternative to conventional frame-based triangulation systems for the 3D measurements of diffuse surfaces. Important benefits include the fast capture speed and the high dynamic range provided by the event camera - albeit at the cost of lower data quality. So far, both low-accuracy event-based and high-accuracy frame…
▽ More
Event-based structured light systems have recently been introduced as an exciting alternative to conventional frame-based triangulation systems for the 3D measurements of diffuse surfaces. Important benefits include the fast capture speed and the high dynamic range provided by the event camera - albeit at the cost of lower data quality. So far, both low-accuracy event-based and high-accuracy frame-based 3D imaging systems are tailored to a specific surface type, such as diffuse or specular, and can not be used for a broader class of object surfaces ("mixed reflectance scenes"). In this work, we present a novel event-based structured light system that enables fast 3D imaging of mixed reflectance scenes with high accuracy. On the captured events, we use epipolar constraints that intrinsically enable decomposing the measured reflections into diffuse, two-bounce specular, and other multi-bounce reflections. The diffuse surfaces in the scene are reconstructed using triangulation. Then, the reconstructed diffuse scene parts are leveraged as a "display" to evaluate the specular scene parts via deflectometry. This novel procedure allows us to use the entire scene as a virtual screen, using only a scanning laser and an event camera. The resulting system achieves fast and motion-robust (14Hz) reconstructions of mixed reflectance scenes with < 600 $μm$ depth error. Moreover, we introduce an "ultrafast" capture mode (250Hz) for the 3D measurement of diffuse scenes.
△ Less
Submitted 10 June, 2025; v1 submitted 16 November, 2023;
originally announced November 2023.
-
ISLAND: Interpolating Land Surface Temperature using land cover
Authors:
Yuhao Liu,
Pranavesh Panakkal,
Sylvia Dee,
Guha Balakrishnan,
Jamie Padgett,
Ashok Veeraraghavan
Abstract:
Cloud occlusion is a common problem in the field of remote sensing, particularly for retrieving Land Surface Temperature (LST). Remote sensing thermal instruments onboard operational satellites are supposed to enable frequent and high-resolution observations over land; unfortunately, clouds adversely affect thermal signals by blocking outgoing longwave radiation emission from the Earth's surface,…
▽ More
Cloud occlusion is a common problem in the field of remote sensing, particularly for retrieving Land Surface Temperature (LST). Remote sensing thermal instruments onboard operational satellites are supposed to enable frequent and high-resolution observations over land; unfortunately, clouds adversely affect thermal signals by blocking outgoing longwave radiation emission from the Earth's surface, interfering with the retrieved ground emission temperature. Such cloud contamination severely reduces the set of serviceable LST images for downstream applications, making it impractical to perform intricate time-series analysis of LST. In this paper, we introduce a novel method to remove cloud occlusions from Landsat 8 LST images. We call our method ISLAND, an acronym for Interpolating Land Surface Temperature using land cover. Our approach uses LST images from Landsat 8 (at 30 m resolution with 16-day revisit cycles) and the NLCD land cover dataset. Inspired by Tobler's first law of Geography, ISLAND predicts occluded LST through a set of spatio-temporal filters that perform distance-weighted spatio-temporal interpolation. A critical feature of ISLAND is that the filters are land cover-class aware, making it particularly advantageous in complex urban settings with heterogeneous land cover types and distributions. Through qualitative and quantitative analysis, we show that ISLAND achieves robust reconstruction performance across a variety of cloud occlusion and surface land cover conditions, and with a high spatio-temporal resolution. We provide a public dataset of 20 U.S. cities with pre-computed ISLAND LST outputs. Using several case studies, we demonstrate that ISLAND opens the door to a multitude of high-impact urban and environmental applications across the continental United States.
△ Less
Submitted 29 August, 2024; v1 submitted 21 September, 2023;
originally announced September 2023.
-
CT Reconstruction from Few Planar X-rays with Application towards Low-resource Radiotherapy
Authors:
Yiran Sun,
Tucker Netherton,
Laurence Court,
Ashok Veeraraghavan,
Guha Balakrishnan
Abstract:
CT scans are the standard-of-care for many clinical ailments, and are needed for treatments like external beam radiotherapy. Unfortunately, CT scanners are rare in low and mid-resource settings due to their costs. Planar X-ray radiography units, in comparison, are far more prevalent, but can only provide limited 2D observations of the 3D anatomy. In this work, we propose a method to generate CT vo…
▽ More
CT scans are the standard-of-care for many clinical ailments, and are needed for treatments like external beam radiotherapy. Unfortunately, CT scanners are rare in low and mid-resource settings due to their costs. Planar X-ray radiography units, in comparison, are far more prevalent, but can only provide limited 2D observations of the 3D anatomy. In this work, we propose a method to generate CT volumes from few (<5) planar X-ray observations using a prior data distribution, and perform the first evaluation of such a reconstruction algorithm for a clinical application: radiotherapy planning. We propose a deep generative model, building on advances in neural implicit representations to synthesize volumetric CT scans from few input planar X-ray images at different angles. To focus the generation task on clinically-relevant features, our model can also leverage anatomical guidance during training (via segmentation masks). We generated 2-field opposed, palliative radiotherapy plans on thoracic CTs reconstructed by our method, and found that isocenter radiation dose on reconstructed scans have <1% error with respect to the dose calculated on clinically acquired CTs using <=4 X-ray views. In addition, our method is better than recent sparse CT reconstruction baselines in terms of standard pixel and structure-level metrics (PSNR, SSIM, Dice score) on the public LIDC lung CT dataset. Code is available at: https://github.com/wanderinrain/Xray2CT.
△ Less
Submitted 3 August, 2023;
originally announced August 2023.
-
NeRT: Implicit Neural Representations for General Unsupervised Turbulence Mitigation
Authors:
Weiyun Jiang,
Yuhao Liu,
Vivek Boominathan,
Ashok Veeraraghavan
Abstract:
The atmospheric and water turbulence mitigation problems have emerged as challenging inverse problems in computer vision and optics communities over the years. However, current methods either rely heavily on the quality of the training dataset or fail to generalize over various scenarios, such as static scenes, dynamic scenes, and text reconstructions. We propose a general implicit neural represen…
▽ More
The atmospheric and water turbulence mitigation problems have emerged as challenging inverse problems in computer vision and optics communities over the years. However, current methods either rely heavily on the quality of the training dataset or fail to generalize over various scenarios, such as static scenes, dynamic scenes, and text reconstructions. We propose a general implicit neural representation for unsupervised atmospheric and water turbulence mitigation (NeRT). NeRT leverages the implicit neural representations and the physically correct tilt-then-blur turbulence model to reconstruct the clean, undistorted image, given only dozens of distorted input images. Moreover, we show that NeRT outperforms the state-of-the-art through various qualitative and quantitative evaluations of atmospheric and water turbulence datasets. Furthermore, we demonstrate the ability of NeRT to eliminate uncontrolled turbulence from real-world environments. Lastly, we incorporate NeRT into continuously captured video sequences and demonstrate $48 \times$ speedup.
△ Less
Submitted 1 April, 2024; v1 submitted 1 August, 2023;
originally announced August 2023.
-
Broadband Thermal Imaging using Meta-Optics
Authors:
Luocheng Huang,
Zheyi Han,
Anna Wirth-Singh,
Vishwanath Saragadam,
Saswata Mukherjee,
Johannes E. Fröch,
Quentin A. A. Tanguy,
Joshua Rollag,
Ricky Gibson,
Joshua R. Hendrickson,
Phillip W. C. Hon,
Orrin Kigner,
Zachary Coppens,
Karl F. Böhringer,
Ashok Veeraraghavan,
Arka Majumdar
Abstract:
Subwavelength diffractive optics known as meta-optics have demonstrated the potential to significantly miniaturize imaging systems. However, despite impressive demonstrations, most meta-optical imaging systems suffer from strong chromatic aberrations, limiting their utilities. Here, we employ inverse-design to create broadband meta-optics operating in the long-wave infrared (LWIR) regime (8 - 12…
▽ More
Subwavelength diffractive optics known as meta-optics have demonstrated the potential to significantly miniaturize imaging systems. However, despite impressive demonstrations, most meta-optical imaging systems suffer from strong chromatic aberrations, limiting their utilities. Here, we employ inverse-design to create broadband meta-optics operating in the long-wave infrared (LWIR) regime (8 - 12 $μ$m). Via a deep-learning assisted multi-scale differentiable framework that links meta-atoms to the phase, we maximize the wavelength-averaged volume under the modulation transfer function (MTF) of the meta-optics. Our design framework merges local phase-engineering via meta-atoms and global engineering of the scatterer within a single pipeline. We corroborate our design by fabricating and experimentally characterizing all-silicon LWIR meta-optics. Our engineered meta-optic is complemented by a simple computational backend that dramatically improves the quality of the captured image. We experimentally demonstrate a six-fold improvement of the wavelength-averaged Strehl ratio over the traditional hyperboloid metalens for broadband imaging.
△ Less
Submitted 5 September, 2023; v1 submitted 21 July, 2023;
originally announced July 2023.
-
Role of Transients in Two-Bounce Non-Line-of-Sight Imaging
Authors:
Siddharth Somasundaram,
Akshat Dave,
Connor Henley,
Ashok Veeraraghavan,
Ramesh Raskar
Abstract:
The goal of non-line-of-sight (NLOS) imaging is to image objects occluded from the camera's field of view using multiply scattered light. Recent works have demonstrated the feasibility of two-bounce (2B) NLOS imaging by scanning a laser and measuring cast shadows of occluded objects in scenes with two relay surfaces. In this work, we study the role of time-of-flight (ToF) measurements, \ie transie…
▽ More
The goal of non-line-of-sight (NLOS) imaging is to image objects occluded from the camera's field of view using multiply scattered light. Recent works have demonstrated the feasibility of two-bounce (2B) NLOS imaging by scanning a laser and measuring cast shadows of occluded objects in scenes with two relay surfaces. In this work, we study the role of time-of-flight (ToF) measurements, \ie transients, in 2B-NLOS under multiplexed illumination. Specifically, we study how ToF information can reduce the number of measurements and spatial resolution needed for shape reconstruction. We present our findings with respect to tradeoffs in (1) temporal resolution, (2) spatial resolution, and (3) number of image captures by studying SNR and recoverability as functions of system parameters. This leads to a formal definition of the mathematical constraints for 2B lidar. We believe that our work lays an analytical groundwork for design of future NLOS imaging systems, especially as ToF sensors become increasingly ubiquitous.
△ Less
Submitted 3 April, 2023;
originally announced April 2023.
-
Thermal Spread Functions (TSF): Physics-guided Material Classification
Authors:
Aniket Dashpute,
Vishwanath Saragadam,
Emma Alexander,
Florian Willomitzer,
Aggelos Katsaggelos,
Ashok Veeraraghavan,
Oliver Cossairt
Abstract:
Robust and non-destructive material classification is a challenging but crucial first-step in numerous vision applications. We propose a physics-guided material classification framework that relies on thermal properties of the object. Our key observation is that the rate of heating and cooling of an object depends on the unique intrinsic properties of the material, namely the emissivity and diffus…
▽ More
Robust and non-destructive material classification is a challenging but crucial first-step in numerous vision applications. We propose a physics-guided material classification framework that relies on thermal properties of the object. Our key observation is that the rate of heating and cooling of an object depends on the unique intrinsic properties of the material, namely the emissivity and diffusivity. We leverage this observation by gently heating the objects in the scene with a low-power laser for a fixed duration and then turning it off, while a thermal camera captures measurements during the heating and cooling process. We then take this spatial and temporal "thermal spread function" (TSF) to solve an inverse heat equation using the finite-differences approach, resulting in a spatially varying estimate of diffusivity and emissivity. These tuples are then used to train a classifier that produces a fine-grained material label at each spatial pixel. Our approach is extremely simple requiring only a small light source (low power laser) and a thermal camera, and produces robust classification results with 86% accuracy over 16 classes.
△ Less
Submitted 2 April, 2023;
originally announced April 2023.
-
WIRE: Wavelet Implicit Neural Representations
Authors:
Vishwanath Saragadam,
Daniel LeJeune,
Jasper Tan,
Guha Balakrishnan,
Ashok Veeraraghavan,
Richard G. Baraniuk
Abstract:
Implicit neural representations (INRs) have recently advanced numerous vision-related areas. INR performance depends strongly on the choice of the nonlinear activation function employed in its multilayer perceptron (MLP) network. A wide range of nonlinearities have been explored, but, unfortunately, current INRs designed to have high accuracy also suffer from poor robustness (to signal noise, para…
▽ More
Implicit neural representations (INRs) have recently advanced numerous vision-related areas. INR performance depends strongly on the choice of the nonlinear activation function employed in its multilayer perceptron (MLP) network. A wide range of nonlinearities have been explored, but, unfortunately, current INRs designed to have high accuracy also suffer from poor robustness (to signal noise, parameter variation, etc.). Inspired by harmonic analysis, we develop a new, highly accurate and robust INR that does not exhibit this tradeoff. Wavelet Implicit neural REpresentation (WIRE) uses a continuous complex Gabor wavelet activation function that is well-known to be optimally concentrated in space-frequency and to have excellent biases for representing images. A wide range of experiments (image denoising, image inpainting, super-resolution, computed tomography reconstruction, image overfitting, and novel view synthesis with neural radiance fields) demonstrate that WIRE defines the new state of the art in INR accuracy, training time, and robustness.
△ Less
Submitted 5 January, 2023;
originally announced January 2023.
-
Foveated Thermal Computational Imaging in the Wild Using All-Silicon Meta-Optics
Authors:
Vishwanath Saragadam,
Zheyi Han,
Vivek Boominathan,
Luocheng Huang,
Shiyu Tan,
Johannes E. Fröch,
Karl F. Böhringer,
Richard G. Baraniuk,
Arka Majumdar,
Ashok Veeraraghavan
Abstract:
Foveated imaging provides a better tradeoff between situational awareness (field of view) and resolution and is critical in long-wavelength infrared regimes because of the size, weight, power, and cost of thermal sensors. We demonstrate computational foveated imaging by exploiting the ability of a meta-optical frontend to discriminate between different polarization states and a computational backe…
▽ More
Foveated imaging provides a better tradeoff between situational awareness (field of view) and resolution and is critical in long-wavelength infrared regimes because of the size, weight, power, and cost of thermal sensors. We demonstrate computational foveated imaging by exploiting the ability of a meta-optical frontend to discriminate between different polarization states and a computational backend to reconstruct the captured image/video. The frontend is a three-element optic: the first element which we call the "foveal" element is a metalens that focuses s-polarized light at a distance of $f_1$ without affecting the p-polarized light; the second element which we call the "perifoveal" element is another metalens that focuses p-polarized light at a distance of $f_2$ without affecting the s-polarized light. The third element is a freely rotating polarizer that dynamically changes the mixing ratios between the two polarization states. Both the foveal element (focal length = 150mm; diameter = 75mm), and the perifoveal element (focal length = 25mm; diameter = 25mm) were fabricated as polarization-sensitive, all-silicon, meta surfaces resulting in a large-aperture, 1:6 foveal expansion, thermal imaging capability. A computational backend then utilizes a deep image prior to separate the resultant multiplexed image or video into a foveated image consisting of a high-resolution center and a lower-resolution large field of view context. We build a first-of-its-kind prototype system and demonstrate 12 frames per second real-time, thermal, foveated image, and video capture in the wild.
△ Less
Submitted 12 December, 2022;
originally announced December 2022.
-
ORCa: Glossy Objects as Radiance Field Cameras
Authors:
Kushagra Tiwary,
Akshat Dave,
Nikhil Behari,
Tzofi Klinghoffer,
Ashok Veeraraghavan,
Ramesh Raskar
Abstract:
Reflections on glossy objects contain valuable and hidden information about the surrounding environment. By converting these objects into cameras, we can unlock exciting applications, including imaging beyond the camera's field-of-view and from seemingly impossible vantage points, e.g. from reflections on the human eye. However, this task is challenging because reflections depend jointly on object…
▽ More
Reflections on glossy objects contain valuable and hidden information about the surrounding environment. By converting these objects into cameras, we can unlock exciting applications, including imaging beyond the camera's field-of-view and from seemingly impossible vantage points, e.g. from reflections on the human eye. However, this task is challenging because reflections depend jointly on object geometry, material properties, the 3D environment, and the observer viewing direction. Our approach converts glossy objects with unknown geometry into radiance-field cameras to image the world from the object's perspective. Our key insight is to convert the object surface into a virtual sensor that captures cast reflections as a 2D projection of the 5D environment radiance field visible to the object. We show that recovering the environment radiance fields enables depth and radiance estimation from the object to its surroundings in addition to beyond field-of-view novel-view synthesis, i.e. rendering of novel views that are only directly-visible to the glossy object present in the scene, but not the observer. Moreover, using the radiance field we can image around occluders caused by close-by objects in the scene. Our method is trained end-to-end on multi-view images of the object and jointly estimates object geometry, diffuse radiance, and the 5D environment radiance field.
△ Less
Submitted 12 December, 2022; v1 submitted 8 December, 2022;
originally announced December 2022.
-
PS$^2$F: Polarized Spiral Point Spread Function for Single-Shot 3D Sensing
Authors:
Bhargav Ghanekar,
Vishwanath Saragadam,
Dushyant Mehra,
Anna-Karin Gustavsson,
Aswin Sankaranarayanan,
Ashok Veeraraghavan
Abstract:
We propose a compact snapshot monocular depth estimation technique that relies on an engineered point spread function (PSF). Traditional approaches used in microscopic super-resolution imaging such as the Double-Helix PSF (DHPSF) are ill-suited for scenes that are more complex than a sparse set of point light sources. We show, using the Cramér-Rao lower bound, that separating the two lobes of the…
▽ More
We propose a compact snapshot monocular depth estimation technique that relies on an engineered point spread function (PSF). Traditional approaches used in microscopic super-resolution imaging such as the Double-Helix PSF (DHPSF) are ill-suited for scenes that are more complex than a sparse set of point light sources. We show, using the Cramér-Rao lower bound, that separating the two lobes of the DHPSF and thereby capturing two separate images leads to a dramatic increase in depth accuracy. A special property of the phase mask used for generating the DHPSF is that a separation of the phase mask into two halves leads to a spatial separation of the two lobes. We leverage this property to build a compact polarization-based optical setup, where we place two orthogonal linear polarizers on each half of the DHPSF phase mask and then capture the resulting image with a polarization-sensitive camera. Results from simulations and a lab prototype demonstrate that our technique achieves up to $50\%$ lower depth error compared to state-of-the-art designs including the DHPSF and the Tetrapod PSF, with little to no loss in spatial resolution.
△ Less
Submitted 4 August, 2022; v1 submitted 2 July, 2022;
originally announced July 2022.
-
i-FlatCam: A 253 FPS, 91.49 $μ$J/Frame Ultra-Compact Intelligent Lensless Camera for Real-Time and Efficient Eye Tracking in VR/AR
Authors:
Yang Zhao,
Ziyun Li,
Yonggan Fu,
Yongan Zhang,
Chaojian Li,
Cheng Wan,
Haoran You,
Shang Wu,
Xu Ouyang,
Vivek Boominathan,
Ashok Veeraraghavan,
Yingyan Celine Lin
Abstract:
We present a first-of-its-kind ultra-compact intelligent camera system, dubbed i-FlatCam, including a lensless camera with a computational (Comp.) chip. It highlights (1) a predict-then-focus eye tracking pipeline for boosted efficiency without compromising the accuracy, (2) a unified compression scheme for single-chip processing and improved frame rate per second (FPS), and (3) dedicated intra-ch…
▽ More
We present a first-of-its-kind ultra-compact intelligent camera system, dubbed i-FlatCam, including a lensless camera with a computational (Comp.) chip. It highlights (1) a predict-then-focus eye tracking pipeline for boosted efficiency without compromising the accuracy, (2) a unified compression scheme for single-chip processing and improved frame rate per second (FPS), and (3) dedicated intra-channel reuse design for depth-wise convolutional layers (DW-CONV) to increase utilization. i-FlatCam demonstrates the first eye tracking pipeline with a lensless camera and achieves 3.16 degrees of accuracy, 253 FPS, 91.49 $μ$J/Frame, and 6.7mm x 8.9mm x 1.2mm camera form factor, paving the way for next-generation Augmented Reality (AR) and Virtual Reality (VR) devices.
△ Less
Submitted 28 March, 2025; v1 submitted 15 June, 2022;
originally announced June 2022.
-
Distributed Generalized Wirtinger Flow for Interferometric Imaging on Networks
Authors:
Sean M. Farrell,
Ashok Veeraraghavan,
Ashutosh Sabharwal,
César A. Uribe
Abstract:
We study the problem of decentralized interferometric imaging over networks, where agents have access to a subset of local radar measurements and can compute pair-wise correlations with their neighbors. We propose a primal-dual distributed algorithm named Distributed Generalized Wirtinger Flow (DGWF). We use the theory of low rank matrix recovery to show when the interferometric imaging problem sa…
▽ More
We study the problem of decentralized interferometric imaging over networks, where agents have access to a subset of local radar measurements and can compute pair-wise correlations with their neighbors. We propose a primal-dual distributed algorithm named Distributed Generalized Wirtinger Flow (DGWF). We use the theory of low rank matrix recovery to show when the interferometric imaging problem satisfies the Regularity Condition, which implies the Polyak-Lojasiewicz inequality. Moreover, we show that DGWF converges geometrically for smooth functions. Numerical simulations for single-scattering radar interferometric imaging demonstrate that DGWF can achieve the same mean-squared error image reconstruction quality as its centralized counterpart for various network connectivity and size.
△ Less
Submitted 8 June, 2022;
originally announced June 2022.
-
EyeCoD: Eye Tracking System Acceleration via FlatCam-based Algorithm & Accelerator Co-Design
Authors:
Haoran You,
Cheng Wan,
Yang Zhao,
Zhongzhi Yu,
Yonggan Fu,
Jiayi Yuan,
Shang Wu,
Shunyao Zhang,
Yongan Zhang,
Chaojian Li,
Vivek Boominathan,
Ashok Veeraraghavan,
Ziyun Li,
Yingyan Celine Lin
Abstract:
Eye tracking has become an essential human-machine interaction modality for providing immersive experience in numerous virtual and augmented reality (VR/AR) applications desiring high throughput (e.g., 240 FPS), small-form, and enhanced visual privacy. However, existing eye tracking systems are still limited by their: (1) large form-factor largely due to the adopted bulky lens-based cameras; and (…
▽ More
Eye tracking has become an essential human-machine interaction modality for providing immersive experience in numerous virtual and augmented reality (VR/AR) applications desiring high throughput (e.g., 240 FPS), small-form, and enhanced visual privacy. However, existing eye tracking systems are still limited by their: (1) large form-factor largely due to the adopted bulky lens-based cameras; and (2) high communication cost required between the camera and backend processor, thus prohibiting their more extensive applications. To this end, we propose a lensless FlatCam-based eye tracking algorithm and accelerator co-design framework dubbed EyeCoD to enable eye tracking systems with a much reduced form-factor and boosted system efficiency without sacrificing the tracking accuracy, paving the way for next-generation eye tracking solutions. On the system level, we advocate the use of lensless FlatCams to facilitate the small form-factor need in mobile eye tracking systems. On the algorithm level, EyeCoD integrates a predict-then-focus pipeline that first predicts the region-of-interest (ROI) via segmentation and then only focuses on the ROI parts to estimate gaze directions, greatly reducing redundant computations and data movements. On the hardware level, we further develop a dedicated accelerator that (1) integrates a novel workload orchestration between the aforementioned segmentation and gaze estimation models, (2) leverages intra-channel reuse opportunities for depth-wise layers, and (3) utilizes input feature-wise partition to save activation memory size. On-silicon measurement validates that our EyeCoD consistently reduces both the communication and computation costs, leading to an overall system speedup of 10.95x, 3.21x, and 12.85x over CPUs, GPUs, and a prior-art eye tracking processor called CIS-GEP, respectively, while maintaining the tracking accuracy.
△ Less
Submitted 2 March, 2025; v1 submitted 2 June, 2022;
originally announced June 2022.
-
DeepTensor: Low-Rank Tensor Decomposition with Deep Network Priors
Authors:
Vishwanath Saragadam,
Randall Balestriero,
Ashok Veeraraghavan,
Richard G. Baraniuk
Abstract:
DeepTensor is a computationally efficient framework for low-rank decomposition of matrices and tensors using deep generative networks. We decompose a tensor as the product of low-rank tensor factors (e.g., a matrix as the outer product of two vectors), where each low-rank tensor is generated by a deep network (DN) that is trained in a self-supervised manner to minimize the mean-squared approximati…
▽ More
DeepTensor is a computationally efficient framework for low-rank decomposition of matrices and tensors using deep generative networks. We decompose a tensor as the product of low-rank tensor factors (e.g., a matrix as the outer product of two vectors), where each low-rank tensor is generated by a deep network (DN) that is trained in a self-supervised manner to minimize the mean-squared approximation error. Our key observation is that the implicit regularization inherent in DNs enables them to capture nonlinear signal structures (e.g., manifolds) that are out of the reach of classical linear methods like the singular value decomposition (SVD) and principal component analysis (PCA). Furthermore, in contrast to the SVD and PCA, whose performance deteriorates when the tensor's entries deviate from additive white Gaussian noise, we demonstrate that the performance of DeepTensor is robust to a wide range of distributions. We validate that DeepTensor is a robust and computationally efficient drop-in replacement for the SVD, PCA, nonnegative matrix factorization (NMF), and similar decompositions by exploring a range of real-world applications, including hyperspectral image denoising, 3D MRI tomography, and image classification. In particular, DeepTensor offers a 6dB signal-to-noise ratio improvement over standard denoising methods for signals corrupted by Poisson noise and learns to decompose 3D tensors 60 times faster than a single DN equipped with 3D convolutions.
△ Less
Submitted 6 April, 2022;
originally announced April 2022.
-
PANDORA: Polarization-Aided Neural Decomposition Of Radiance
Authors:
Akshat Dave,
Yongyi Zhao,
Ashok Veeraraghavan
Abstract:
Reconstructing an object's geometry and appearance from multiple images, also known as inverse rendering, is a fundamental problem in computer graphics and vision. Inverse rendering is inherently ill-posed because the captured image is an intricate function of unknown lighting conditions, material properties and scene geometry. Recent progress in representing scene properties as coordinate-based n…
▽ More
Reconstructing an object's geometry and appearance from multiple images, also known as inverse rendering, is a fundamental problem in computer graphics and vision. Inverse rendering is inherently ill-posed because the captured image is an intricate function of unknown lighting conditions, material properties and scene geometry. Recent progress in representing scene properties as coordinate-based neural networks have facilitated neural inverse rendering resulting in impressive geometry reconstruction and novel-view synthesis. Our key insight is that polarization is a useful cue for neural inverse rendering as polarization strongly depends on surface normals and is distinct for diffuse and specular reflectance. With the advent of commodity, on-chip, polarization sensors, capturing polarization has become practical. Thus, we propose PANDORA, a polarimetric inverse rendering approach based on implicit neural representations. From multi-view polarization images of an object, PANDORA jointly extracts the object's 3D geometry, separates the outgoing radiance into diffuse and specular and estimates the illumination incident on the object. We show that PANDORA outperforms state-of-the-art radiance decomposition techniques. PANDORA outputs clean surface reconstructions free from texture artefacts, models strong specularities accurately and estimates illumination under practical unstructured scenarios.
△ Less
Submitted 25 March, 2022;
originally announced March 2022.
-
MINER: Multiscale Implicit Neural Representations
Authors:
Vishwanath Saragadam,
Jasper Tan,
Guha Balakrishnan,
Richard G. Baraniuk,
Ashok Veeraraghavan
Abstract:
We introduce a new neural signal model designed for efficient high-resolution representation of large-scale signals. The key innovation in our multiscale implicit neural representation (MINER) is an internal representation via a Laplacian pyramid, which provides a sparse multiscale decomposition of the signal that captures orthogonal parts of the signal across scales. We leverage the advantages of…
▽ More
We introduce a new neural signal model designed for efficient high-resolution representation of large-scale signals. The key innovation in our multiscale implicit neural representation (MINER) is an internal representation via a Laplacian pyramid, which provides a sparse multiscale decomposition of the signal that captures orthogonal parts of the signal across scales. We leverage the advantages of the Laplacian pyramid by representing small disjoint patches of the pyramid at each scale with a small MLP. This enables the capacity of the network to adaptively increase from coarse to fine scales, and only represent parts of the signal with strong signal energy. The parameters of each MLP are optimized from coarse-to-fine scale which results in faster approximations at coarser scales, thereby ultimately an extremely fast training process. We apply MINER to a range of large-scale signal representation tasks, including gigapixel images and very large point clouds, and demonstrate that it requires fewer than 25% of the parameters, 33% of the memory footprint, and 10% of the computation time of competing techniques such as ACORN to reach the same representation accuracy.
△ Less
Submitted 17 July, 2022; v1 submitted 7 February, 2022;
originally announced February 2022.
-
Deep-3D Microscope: 3D volumetric microscopy of thick scattering samples using a wide-field microscope and machine learning
Authors:
Bowen Li,
Shiyu Tan,
Jiuyang Dong,
Xiaocong Lian,
Yongbing Zhang,
Xiangyang Ji,
Ashok Veeraraghavan
Abstract:
Confocal microscopy is the standard approach for obtaining volumetric images of a sample with high axial and lateral resolution, especially when dealing with scattering samples. Unfortunately, a confocal microscope is quite expensive compared to traditional microscopes. In addition, the point scanning in a confocal leads to slow imaging speed and photobleaching due to the high dose of laser energy…
▽ More
Confocal microscopy is the standard approach for obtaining volumetric images of a sample with high axial and lateral resolution, especially when dealing with scattering samples. Unfortunately, a confocal microscope is quite expensive compared to traditional microscopes. In addition, the point scanning in a confocal leads to slow imaging speed and photobleaching due to the high dose of laser energy. In this paper, we demonstrate how the advances in machine learning can be exploited to "teach" a traditional wide-field microscope, one that's available in every lab, into producing 3D volumetric images like a confocal. The key idea is to obtain multiple images with different focus settings using a wide-field microscope and use a 3D Generative Adversarial Network (GAN) based neural network to learn the mapping between the blurry low-contrast image stack obtained using wide-field and the sharp, high-contrast images obtained using a confocal. After training the network with widefield-confocal image pairs, the network can reliably and accurately reconstruct 3D volumetric images that rival confocal in terms of its lateral resolution, z-sectioning and image contrast. Our experimental results demonstrate generalization ability to handle unseen data, stability in the reconstruction results, high spatial resolution even when imaging thick ($\sim40$ microns) highly-scattering samples. We believe that such learning-based-microscopes have the potential to bring confocal quality imaging to every lab that has a wide-field microscope.
△ Less
Submitted 14 October, 2021;
originally announced October 2021.
-
Thermal Image Processing via Physics-Inspired Deep Networks
Authors:
Vishwanath Saragadam,
Akshat Dave,
Ashok Veeraraghavan,
Richard Baraniuk
Abstract:
We introduce DeepIR, a new thermal image processing framework that combines physically accurate sensor modeling with deep network-based image representation. Our key enabling observations are that the images captured by thermal sensors can be factored into slowly changing, scene-independent sensor non-uniformities (that can be accurately modeled using physics) and a scene-specific radiance flux (t…
▽ More
We introduce DeepIR, a new thermal image processing framework that combines physically accurate sensor modeling with deep network-based image representation. Our key enabling observations are that the images captured by thermal sensors can be factored into slowly changing, scene-independent sensor non-uniformities (that can be accurately modeled using physics) and a scene-specific radiance flux (that is well-represented using a deep network-based regularizer). DeepIR requires neither training data nor periodic ground-truth calibration with a known black body target--making it well suited for practical computer vision tasks. We demonstrate the power of going DeepIR by developing new denoising and super-resolution algorithms that exploit multiple images of the scene captured with camera jitter. Simulated and real data experiments demonstrate that DeepIR can perform high-quality non-uniformity correction with as few as three images, achieving a 10dB PSNR improvement over competing approaches.
△ Less
Submitted 25 August, 2021; v1 submitted 18 August, 2021;
originally announced August 2021.
-
CodedStereo: Learned Phase Masks for Large Depth-of-field Stereo
Authors:
Shiyu Tan,
Yicheng Wu,
Shoou-I Yu,
Ashok Veeraraghavan
Abstract:
Conventional stereo suffers from a fundamental trade-off between imaging volume and signal-to-noise ratio (SNR) -- due to the conflicting impact of aperture size on both these variables. Inspired by the extended depth of field cameras, we propose a novel end-to-end learning-based technique to overcome this limitation, by introducing a phase mask at the aperture plane of the cameras in a stereo ima…
▽ More
Conventional stereo suffers from a fundamental trade-off between imaging volume and signal-to-noise ratio (SNR) -- due to the conflicting impact of aperture size on both these variables. Inspired by the extended depth of field cameras, we propose a novel end-to-end learning-based technique to overcome this limitation, by introducing a phase mask at the aperture plane of the cameras in a stereo imaging system. The phase mask creates a depth-dependent point spread function, allowing us to recover sharp image texture and stereo correspondence over a significantly extended depth of field (EDOF) than conventional stereo. The phase mask pattern, the EDOF image reconstruction, and the stereo disparity estimation are all trained together using an end-to-end learned deep neural network. We perform theoretical analysis and characterization of the proposed approach and show a 6x increase in volume that can be imaged in simulation. We also build an experimental prototype and validate the approach using real-world results acquired using this prototype system.
△ Less
Submitted 9 April, 2021;
originally announced April 2021.
-
High Resolution, Deep Imaging Using Confocal Time-of-flight Diffuse Optical Tomography
Authors:
Yongyi Zhao,
Ankit Raghuram,
Hyun K. Kim,
Andreas H. Hielscher,
Jacob T. Robinson,
Ashok Veeraraghavan
Abstract:
Light scattering by tissue severely limits how deep beneath the surface one can image, and the spatial resolution one can obtain from these images. Diffuse optical tomography (DOT) is one of the most powerful techniques for imaging deep within tissue -- well beyond the conventional $\sim$10-15 mean scattering lengths tolerated by ballistic imaging techniques such as confocal and two-photon microsc…
▽ More
Light scattering by tissue severely limits how deep beneath the surface one can image, and the spatial resolution one can obtain from these images. Diffuse optical tomography (DOT) is one of the most powerful techniques for imaging deep within tissue -- well beyond the conventional $\sim$10-15 mean scattering lengths tolerated by ballistic imaging techniques such as confocal and two-photon microscopy. Unfortunately, existing DOT systems are limited, achieving only centimeter-scale resolution. Furthermore, they suffer from slow acquisition times and slow reconstruction speeds making real-time imaging infeasible. We show that time-of-flight diffuse optical tomography (ToF-DOT) and its confocal variant (CToF-DOT), by exploiting the photon travel time information, allow us to achieve millimeter spatial resolution in the highly scattered diffusion regime ($> 50 $ mean free paths). In addition, we demonstrate two additional innovations: focusing on confocal measurements, and multiplexing the illumination sources allow us to significantly reduce the measurement acquisition time. Finally, we rely on a novel convolutional approximation that allows us to develop a fast reconstruction algorithm, achieving a 100$\times$ speedup in reconstruction time compared to traditional DOT reconstruction techniques. Together, we believe that these technical advances serve as the first step towards real-time, millimeter resolution, deep tissue imaging using DOT.
△ Less
Submitted 27 May, 2021; v1 submitted 27 January, 2021;
originally announced January 2021.
-
SASSI -- Super-Pixelated Adaptive Spatio-Spectral Imaging
Authors:
Vishwanath Saragadam,
Michael DeZeeuw,
Richard Baraniuk,
Ashok Veeraraghavan,
Aswin Sankaranarayanan
Abstract:
We introduce a novel video-rate hyperspectral imager with high spatial, and temporal resolutions. Our key hypothesis is that spectral profiles of pixels in a super-pixel of an oversegmented image tend to be very similar. Hence, a scene-adaptive spatial sampling of an hyperspectral scene, guided by its super-pixel segmented image, is capable of obtaining high-quality reconstructions. To achieve thi…
▽ More
We introduce a novel video-rate hyperspectral imager with high spatial, and temporal resolutions. Our key hypothesis is that spectral profiles of pixels in a super-pixel of an oversegmented image tend to be very similar. Hence, a scene-adaptive spatial sampling of an hyperspectral scene, guided by its super-pixel segmented image, is capable of obtaining high-quality reconstructions. To achieve this, we acquire an RGB image of the scene, compute its super-pixels, from which we generate a spatial mask of locations where we measure high-resolution spectrum. The hyperspectral image is subsequently estimated by fusing the RGB image and the spectral measurements using a learnable guided filtering approach. Due to low computational complexity of the superpixel estimation step, our setup can capture hyperspectral images of the scenes with little overhead over traditional snapshot hyperspectral cameras, but with significantly higher spatial and spectral resolutions. We validate the proposed technique with extensive simulations as well as a lab prototype that measures hyperspectral video at a spatial resolution of $600 \times 900$ pixels, at a spectral resolution of 10 nm over visible wavebands, and achieving a frame rate at $18$fps.
△ Less
Submitted 28 December, 2020;
originally announced December 2020.
-
How to Train Neural Networks for Flare Removal
Authors:
Yicheng Wu,
Qiurui He,
Tianfan Xue,
Rahul Garg,
Jiawen Chen,
Ashok Veeraraghavan,
Jonathan T. Barron
Abstract:
When a camera is pointed at a strong light source, the resulting photograph may contain lens flare artifacts. Flares appear in a wide variety of patterns (halos, streaks, color bleeding, haze, etc.) and this diversity in appearance makes flare removal challenging. Existing analytical solutions make strong assumptions about the artifact's geometry or brightness, and therefore only work well on a sm…
▽ More
When a camera is pointed at a strong light source, the resulting photograph may contain lens flare artifacts. Flares appear in a wide variety of patterns (halos, streaks, color bleeding, haze, etc.) and this diversity in appearance makes flare removal challenging. Existing analytical solutions make strong assumptions about the artifact's geometry or brightness, and therefore only work well on a small subset of flares. Machine learning techniques have shown success in removing other types of artifacts, like reflections, but have not been widely applied to flare removal due to the lack of training data. To solve this problem, we explicitly model the optical causes of flare either empirically or using wave optics, and generate semi-synthetic pairs of flare-corrupted and clean images. This enables us to train neural networks to remove lens flare for the first time. Experiments show our data synthesis approach is critical for accurate flare removal, and that models trained with our technique generalize well to real lens flares across different scenes, lighting conditions, and cameras.
△ Less
Submitted 7 October, 2021; v1 submitted 24 November, 2020;
originally announced November 2020.
-
FlatNet: Towards Photorealistic Scene Reconstruction from Lensless Measurements
Authors:
Salman S. Khan,
Varun Sundar,
Vivek Boominathan,
Ashok Veeraraghavan,
Kaushik Mitra
Abstract:
Lensless imaging has emerged as a potential solution towards realizing ultra-miniature cameras by eschewing the bulky lens in a traditional camera. Without a focusing lens, the lensless cameras rely on computational algorithms to recover the scenes from multiplexed measurements. However, the current iterative-optimization-based reconstruction algorithms produce noisier and perceptually poorer imag…
▽ More
Lensless imaging has emerged as a potential solution towards realizing ultra-miniature cameras by eschewing the bulky lens in a traditional camera. Without a focusing lens, the lensless cameras rely on computational algorithms to recover the scenes from multiplexed measurements. However, the current iterative-optimization-based reconstruction algorithms produce noisier and perceptually poorer images. In this work, we propose a non-iterative deep learning based reconstruction approach that results in orders of magnitude improvement in image quality for lensless reconstructions. Our approach, called $\textit{FlatNet}$, lays down a framework for reconstructing high-quality photorealistic images from mask-based lensless cameras, where the camera's forward model formulation is known. FlatNet consists of two stages: (1) an inversion stage that maps the measurement into a space of intermediate reconstruction by learning parameters within the forward model formulation, and (2) a perceptual enhancement stage that improves the perceptual quality of this intermediate reconstruction. These stages are trained together in an end-to-end manner. We show high-quality reconstructions by performing extensive experiments on real and challenging scenes using two different types of lensless prototypes: one which uses a separable forward model and another, which uses a more general non-separable cropped-convolution model. Our end-to-end approach is fast, produces photorealistic reconstructions, and is easy to adopt for other mask-based lensless cameras.
△ Less
Submitted 29 October, 2020;
originally announced October 2020.
-
The Benefit of Distraction: Denoising Remote Vitals Measurements using Inverse Attention
Authors:
Ewa Nowara,
Daniel McDuff,
Ashok Veeraraghavan
Abstract:
Attention is a powerful concept in computer vision. End-to-end networks that learn to focus selectively on regions of an image or video often perform strongly. However, other image regions, while not necessarily containing the signal of interest, may contain useful context. We present an approach that exploits the idea that statistics of noise may be shared between the regions that contain the sig…
▽ More
Attention is a powerful concept in computer vision. End-to-end networks that learn to focus selectively on regions of an image or video often perform strongly. However, other image regions, while not necessarily containing the signal of interest, may contain useful context. We present an approach that exploits the idea that statistics of noise may be shared between the regions that contain the signal of interest and those that do not. Our technique uses the inverse of an attention mask to generate a noise estimate that is then used to denoise temporal observations. We apply this to the task of camera-based physiological measurement. A convolutional attention network is used to learn which regions of a video contain the physiological signal and generate a preliminary estimate. A noise estimate is obtained by using the pixel intensities in the inverse regions of the learned attention mask, this in turn is used to refine the estimate of the physiological signal. We perform experiments on two large benchmark datasets and show that this approach produces state-of-the-art results, increasing the signal-to-noise ratio by up to 5.8 dB, reducing heart rate and breathing rate estimation error by as much as 30%, recovering subtle pulse waveform dynamics, and generalizing from RGB to NIR videos without retraining.
△ Less
Submitted 14 October, 2020;
originally announced October 2020.
-
Fine-grained Classification using Heterogeneous Web Data and Auxiliary Categories
Authors:
Li Niu,
Ashok Veeraraghavan,
Ashu Sabharwal
Abstract:
Fine-grained classification remains a very challenging problem, because of the absence of well-labeled training data caused by the high cost of annotating a large number of fine-grained categories. In the extreme case, given a set of test categories without any well-labeled training data, the majority of existing works can be grouped into the following two research directions: 1) crawl noisy label…
▽ More
Fine-grained classification remains a very challenging problem, because of the absence of well-labeled training data caused by the high cost of annotating a large number of fine-grained categories. In the extreme case, given a set of test categories without any well-labeled training data, the majority of existing works can be grouped into the following two research directions: 1) crawl noisy labeled web data for the test categories as training data, which is dubbed as webly supervised learning; 2) transfer the knowledge from auxiliary categories with well-labeled training data to the test categories, which corresponds to zero-shot learning setting. Nevertheless, the above two research directions still have critical issues to be addressed. For the first direction, web data have noisy labels and considerably different data distribution from test data. For the second direction, zero-shot learning is struggling to achieve compelling results compared with conventional supervised learning. The issues of the above two directions motivate us to develop a novel approach which can jointly exploit both noisy web training data from test categories and well-labeled training data from auxiliary categories. In particular, on one hand, we crawl web data for test categories as noisy training data. On the other hand, we transfer the knowledge from auxiliary categories with well-labeled training data to test categories by virtue of free semantic information (e.g., word vector) of all categories. Moreover, given the fact that web data are generally associated with additional textual information (e.g., title and tag), we extend our method by using the surrounding textual information of web data as privileged information. Extensive experiments show the effectiveness of our proposed methods.
△ Less
Submitted 19 November, 2018;
originally announced November 2018.
-
Deep $k$-Means: Re-Training and Parameter Sharing with Harder Cluster Assignments for Compressing Deep Convolutions
Authors:
Junru Wu,
Yue Wang,
Zhenyu Wu,
Zhangyang Wang,
Ashok Veeraraghavan,
Yingyan Lin
Abstract:
The current trend of pushing CNNs deeper with convolutions has created a pressing demand to achieve higher compression gains on CNNs where convolutions dominate the computation and parameter amount (e.g., GoogLeNet, ResNet and Wide ResNet). Further, the high energy consumption of convolutions limits its deployment on mobile devices. To this end, we proposed a simple yet effective scheme for compre…
▽ More
The current trend of pushing CNNs deeper with convolutions has created a pressing demand to achieve higher compression gains on CNNs where convolutions dominate the computation and parameter amount (e.g., GoogLeNet, ResNet and Wide ResNet). Further, the high energy consumption of convolutions limits its deployment on mobile devices. To this end, we proposed a simple yet effective scheme for compressing convolutions though applying k-means clustering on the weights, compression is achieved through weight-sharing, by only recording $K$ cluster centers and weight assignment indexes. We then introduced a novel spectrally relaxed $k$-means regularization, which tends to make hard assignments of convolutional layer weights to $K$ learned cluster centers during re-training. We additionally propose an improved set of metrics to estimate energy consumption of CNN hardware implementations, whose estimation results are verified to be consistent with previously proposed energy estimation tool extrapolated from actual hardware measurements. We finally evaluated Deep $k$-Means across several CNN models in terms of both compression ratio and energy consumption reduction, observing promising results without incurring accuracy loss. The code is available at https://github.com/Sandbox3aster/Deep-K-Means
△ Less
Submitted 24 June, 2018;
originally announced June 2018.
-
Signal Processing Based Pile-up Compensation for Gated Single-Photon Avalanche Diodes
Authors:
Adithya K. Pediredla,
Aswin C. Sankaranarayanan,
Mauro Buttafava,
Alberto Tosi,
Ashok Veeraraghavan
Abstract:
Single-photon avalanche diode (SPAD) based transient imaging suffers from an aberration called pile-up. When multiple photons arrive within a single repetition period of the illuminating laser, the SPAD records only the arrival of the first photon; this leads to a bias in the recorded light transient wherein the transient response at later time-instants are under-estimated. An unfortunate conseque…
▽ More
Single-photon avalanche diode (SPAD) based transient imaging suffers from an aberration called pile-up. When multiple photons arrive within a single repetition period of the illuminating laser, the SPAD records only the arrival of the first photon; this leads to a bias in the recorded light transient wherein the transient response at later time-instants are under-estimated. An unfortunate consequence of this is the need to operate the illumination at low-power levels to reduce the probability of multiple photons returning in a single period. Operating the laser at low power results in either low signal-to-noise ratio (SNR) in the measured transients or reduced frame rate due to longer exposure durations to achieve a high SNR. In this paper, we propose a signal processing-based approach to compensate pile-up in post-processing, thereby enabling high power operation of the illuminating laser. While increasing illumination does cause a fundamental information loss in the data captured by SPAD, we quantify this information loss using Cramer-Rao bound and show that the errors in our framework are only limited to this information loss. We experimentally validate our hypotheses using real data from a lab prototype.
△ Less
Submitted 14 June, 2018;
originally announced June 2018.
-
Fast Retinomorphic Event Stream for Video Recognition and Reinforcement Learning
Authors:
Wanjia Liu,
Huaijin Chen,
Rishab Goel,
Yuzhong Huang,
Ashok Veeraraghavan,
Ankit Patel
Abstract:
Good temporal representations are crucial for video understanding, and the state-of-the-art video recognition framework is based on two-stream networks. In such framework, besides the regular ConvNets responsible for RGB frame inputs, a second network is introduced to handle the temporal representation, usually the optical flow (OF). However, OF or other task-oriented flow is computationally costl…
▽ More
Good temporal representations are crucial for video understanding, and the state-of-the-art video recognition framework is based on two-stream networks. In such framework, besides the regular ConvNets responsible for RGB frame inputs, a second network is introduced to handle the temporal representation, usually the optical flow (OF). However, OF or other task-oriented flow is computationally costly, and is thus typically pre-computed. Critically, this prevents the two-stream approach from being applied to reinforcement learning (RL) applications such as video game playing, where the next state depends on current state and action choices. Inspired by the early vision systems of mammals and insects, we propose a fast event-driven representation (EDR) that models several major properties of early retinal circuits: (1) logarithmic input response, (2) multi-timescale temporal smoothing to filter noise, and (3) bipolar (ON/OFF) pathways for primitive event detection[12]. Trading off the directional information for fast speed (> 9000 fps), EDR en-ables fast real-time inference/learning in video applications that require interaction between an agent and the world such as game-playing, virtual robotics, and domain adaptation. In this vein, we use EDR to demonstrate performance improvements over state-of-the-art reinforcement learning algorithms for Atari games, something that has not been possible with pre-computed OF. Moreover, with UCF-101 video action recognition experiments, we show that EDR performs near state-of-the-art in accuracy while achieving a 1,500x speedup in input representation processing, as compared to optical flow.
△ Less
Submitted 19 May, 2018; v1 submitted 16 May, 2018;
originally announced May 2018.
-
Learning from Noisy Web Data with Category-level Supervision
Authors:
Li Niu,
Qingtao Tang,
Ashok Veeraraghavan,
Ashu Sabharwal
Abstract:
As tons of photos are being uploaded to public websites (e.g., Flickr, Bing, and Google) every day, learning from web data has become an increasingly popular research direction because of freely available web resources, which is also referred to as webly supervised learning. Nevertheless, the performance gap between webly supervised learning and traditional supervised learning is still very large,…
▽ More
As tons of photos are being uploaded to public websites (e.g., Flickr, Bing, and Google) every day, learning from web data has become an increasingly popular research direction because of freely available web resources, which is also referred to as webly supervised learning. Nevertheless, the performance gap between webly supervised learning and traditional supervised learning is still very large, owning to the label noise of web data. To be exact, the labels of images crawled from public websites are very noisy and often inaccurate. Some existing works tend to facilitate learning from web data with the aid of extra information, such as augmenting or purifying web data by virtue of instance-level supervision, which is usually in demand of heavy manual annotation. Instead, we propose to tackle the label noise by leveraging more accessible category-level supervision. In particular, we build our method upon variational autoencoder (VAE), in which the classification network is attached on the hidden layer of VAE in a way that the classification network and VAE can jointly leverage the category-level hybrid semantic information. The effectiveness of our proposed method is clearly demonstrated by extensive experiments on three benchmark datasets.
△ Less
Submitted 24 May, 2018; v1 submitted 10 March, 2018;
originally announced March 2018.
-
prDeep: Robust Phase Retrieval with a Flexible Deep Network
Authors:
Christopher A. Metzler,
Philip Schniter,
Ashok Veeraraghavan,
Richard G. Baraniuk
Abstract:
Phase retrieval algorithms have become an important component in many modern computational imaging systems. For instance, in the context of ptychography and speckle correlation imaging, they enable imaging past the diffraction limit and through scattering media, respectively. Unfortunately, traditional phase retrieval algorithms struggle in the presence of noise. Progress has been made recently on…
▽ More
Phase retrieval algorithms have become an important component in many modern computational imaging systems. For instance, in the context of ptychography and speckle correlation imaging, they enable imaging past the diffraction limit and through scattering media, respectively. Unfortunately, traditional phase retrieval algorithms struggle in the presence of noise. Progress has been made recently on more robust algorithms using signal priors, but at the expense of limiting the range of supported measurement models (e.g., to Gaussian or coded diffraction patterns). In this work we leverage the regularization-by-denoising framework and a convolutional neural network denoiser to create prDeep, a new phase retrieval algorithm that is both robust and broadly applicable. We test and validate prDeep in simulation to demonstrate that it is robust to noise and can handle a variety of system models.
A MatConvNet implementation of prDeep is available at https://github.com/ricedsp/prDeep.
△ Less
Submitted 29 June, 2018; v1 submitted 28 February, 2018;
originally announced March 2018.
-
Reblur2Deblur: Deblurring Videos via Self-Supervised Learning
Authors:
Huaijin Chen,
Jinwei Gu,
Orazio Gallo,
Ming-Yu Liu,
Ashok Veeraraghavan,
Jan Kautz
Abstract:
Motion blur is a fundamental problem in computer vision as it impacts image quality and hinders inference. Traditional deblurring algorithms leverage the physics of the image formation model and use hand-crafted priors: they usually produce results that better reflect the underlying scene, but present artifacts. Recent learning-based methods implicitly extract the distribution of natural images di…
▽ More
Motion blur is a fundamental problem in computer vision as it impacts image quality and hinders inference. Traditional deblurring algorithms leverage the physics of the image formation model and use hand-crafted priors: they usually produce results that better reflect the underlying scene, but present artifacts. Recent learning-based methods implicitly extract the distribution of natural images directly from the data and use it to synthesize plausible images. Their results are impressive, but they are not always faithful to the content of the latent image. We present an approach that bridges the two. Our method fine-tunes existing deblurring neural networks in a self-supervised fashion by enforcing that the output, when blurred based on the optical flow between subsequent frames, matches the input blurry image. We show that our method significantly improves the performance of existing methods on several datasets both visually and in terms of image quality metrics. The supplementary material is https://goo.gl/nYPjEQ
△ Less
Submitted 16 January, 2018;
originally announced January 2018.