-
L4P: Low-Level 4D Vision Perception Unified
Authors:
Abhishek Badki,
Hang Su,
Bowen Wen,
Orazio Gallo
Abstract:
The spatio-temporal relationship between the pixels of a video carries critical information for low-level 4D perception tasks. A single model that reasons about it should be able to solve several such tasks well. Yet, most state-of-the-art methods rely on architectures specialized for the task at hand. We present L4P, a feedforward, general-purpose architecture that solves low-level 4D perception…
▽ More
The spatio-temporal relationship between the pixels of a video carries critical information for low-level 4D perception tasks. A single model that reasons about it should be able to solve several such tasks well. Yet, most state-of-the-art methods rely on architectures specialized for the task at hand. We present L4P, a feedforward, general-purpose architecture that solves low-level 4D perception tasks in a unified framework. L4P leverages a pre-trained ViT-based video encoder and combines it with per-task heads that are lightweight and therefore do not require extensive training. Despite its general and feedforward formulation, our method matches or surpasses the performance of existing specialized methods on both dense tasks, such as depth or optical flow estimation, and sparse tasks, such as 2D/3D tracking. Moreover, it solves all tasks at once in a time comparable to that of single-task methods.
△ Less
Submitted 25 April, 2025; v1 submitted 18 February, 2025;
originally announced February 2025.
-
Zero-Shot Monocular Scene Flow Estimation in the Wild
Authors:
Yiqing Liang,
Abhishek Badki,
Hang Su,
James Tompkin,
Orazio Gallo
Abstract:
Large models have shown generalization across datasets for many low-level vision tasks, like depth estimation, but no such general models exist for scene flow. Even though scene flow has wide potential use, it is not used in practice because current predictive models do not generalize well. We identify three key challenges and propose solutions for each. First, we create a method that jointly esti…
▽ More
Large models have shown generalization across datasets for many low-level vision tasks, like depth estimation, but no such general models exist for scene flow. Even though scene flow has wide potential use, it is not used in practice because current predictive models do not generalize well. We identify three key challenges and propose solutions for each. First, we create a method that jointly estimates geometry and motion for accurate prediction. Second, we alleviate scene flow data scarcity with a data recipe that affords us 1M annotated training samples across diverse synthetic scenes. Third, we evaluate different parameterizations for scene flow prediction and adopt a natural and effective parameterization. Our resulting model outperforms existing methods as well as baselines built on large-scale models in terms of 3D end-point error, and shows zero-shot generalization to the casually captured videos from DAVIS and the robotic manipulation scenes from RoboTAP. Overall, our approach makes scene flow prediction more practical in-the-wild.
△ Less
Submitted 19 January, 2025; v1 submitted 17 January, 2025;
originally announced January 2025.
-
FoundationStereo: Zero-Shot Stereo Matching
Authors:
Bowen Wen,
Matthew Trepte,
Joseph Aribido,
Jan Kautz,
Orazio Gallo,
Stan Birchfield
Abstract:
Tremendous progress has been made in deep stereo matching to excel on benchmark datasets through per-domain fine-tuning. However, achieving strong zero-shot generalization - a hallmark of foundation models in other computer vision tasks - remains challenging for stereo matching. We introduce FoundationStereo, a foundation model for stereo depth estimation designed to achieve strong zero-shot gener…
▽ More
Tremendous progress has been made in deep stereo matching to excel on benchmark datasets through per-domain fine-tuning. However, achieving strong zero-shot generalization - a hallmark of foundation models in other computer vision tasks - remains challenging for stereo matching. We introduce FoundationStereo, a foundation model for stereo depth estimation designed to achieve strong zero-shot generalization. To this end, we first construct a large-scale (1M stereo pairs) synthetic training dataset featuring large diversity and high photorealism, followed by an automatic self-curation pipeline to remove ambiguous samples. We then design a number of network architecture components to enhance scalability, including a side-tuning feature backbone that adapts rich monocular priors from vision foundation models to mitigate the sim-to-real gap, and long-range context reasoning for effective cost volume filtering. Together, these components lead to strong robustness and accuracy across domains, establishing a new standard in zero-shot stereo depth estimation. Project page: https://nvlabs.github.io/FoundationStereo/
△ Less
Submitted 3 April, 2025; v1 submitted 16 January, 2025;
originally announced January 2025.
-
nvTorchCam: An Open-source Library for Camera-Agnostic Differentiable Geometric Vision
Authors:
Daniel Lichy,
Hang Su,
Abhishek Badki,
Jan Kautz,
Orazio Gallo
Abstract:
We introduce nvTorchCam, an open-source library under the Apache 2.0 license, designed to make deep learning algorithms camera model-independent. nvTorchCam abstracts critical camera operations such as projection and unprojection, allowing developers to implement algorithms once and apply them across diverse camera models--including pinhole, fisheye, and 360 equirectangular panoramas, which are co…
▽ More
We introduce nvTorchCam, an open-source library under the Apache 2.0 license, designed to make deep learning algorithms camera model-independent. nvTorchCam abstracts critical camera operations such as projection and unprojection, allowing developers to implement algorithms once and apply them across diverse camera models--including pinhole, fisheye, and 360 equirectangular panoramas, which are commonly used in automotive and real estate capture applications. Built on PyTorch, nvTorchCam is fully differentiable and supports GPU acceleration and batching for efficient computation. Furthermore, deep learning models trained for one camera type can be directly transferred to other camera types without requiring additional modification. In this paper, we provide an overview of nvTorchCam, its functionality, and present various code examples and diagrams to demonstrate its usage. Source code and installation instructions can be found on the nvTorchCam GitHub page at https://github.com/NVlabs/nvTorchCam.
△ Less
Submitted 15 October, 2024;
originally announced October 2024.
-
Anonymity and strategy-proofness on a domain of single-peaked and single-dipped preferences
Authors:
Oihane Gallo
Abstract:
We analyze the problem of locating a public facility on a line in a society where agents have either single-peaked or single-dipped preferences. We consider the domain analyzed in Alcalde-Unzu et al. (2024), where the type of preference of each agent is public information, but the location of her peak/dip as well as the rest of the preference are unknown. We characterize all strategy-proof and typ…
▽ More
We analyze the problem of locating a public facility on a line in a society where agents have either single-peaked or single-dipped preferences. We consider the domain analyzed in Alcalde-Unzu et al. (2024), where the type of preference of each agent is public information, but the location of her peak/dip as well as the rest of the preference are unknown. We characterize all strategy-proof and type-anonymous rules on this domain. Building on existing results, we provide a two-step characterization": first, the median between the peaks and a collection of fixed values is computed (Moulin, 1980), resulting in either a single alternative or a pair of contiguous alternatives. If the outcome of the median is a pair, we apply a double-quota majority method" in the second step to choose between the two alternatives in the pair (Moulin, 1983). We also show the additional conditions that type-anonymity imposes on the strategy-proof rules characterized by Alcalde-Unzu et al. (2024). Finally, we show the equivalence between the two characterizations.
△ Less
Submitted 4 October, 2024;
originally announced October 2024.
-
FoVA-Depth: Field-of-View Agnostic Depth Estimation for Cross-Dataset Generalization
Authors:
Daniel Lichy,
Hang Su,
Abhishek Badki,
Jan Kautz,
Orazio Gallo
Abstract:
Wide field-of-view (FoV) cameras efficiently capture large portions of the scene, which makes them attractive in multiple domains, such as automotive and robotics. For such applications, estimating depth from multiple images is a critical task, and therefore, a large amount of ground truth (GT) data is available. Unfortunately, most of the GT data is for pinhole cameras, making it impossible to pr…
▽ More
Wide field-of-view (FoV) cameras efficiently capture large portions of the scene, which makes them attractive in multiple domains, such as automotive and robotics. For such applications, estimating depth from multiple images is a critical task, and therefore, a large amount of ground truth (GT) data is available. Unfortunately, most of the GT data is for pinhole cameras, making it impossible to properly train depth estimation models for large-FoV cameras. We propose the first method to train a stereo depth estimation model on the widely available pinhole data, and to generalize it to data captured with larger FoVs. Our intuition is simple: We warp the training data to a canonical, large-FoV representation and augment it to allow a single network to reason about diverse types of distortions that otherwise would prevent generalization. We show strong generalization ability of our approach on both indoor and outdoor datasets, which was not possible with previous methods.
△ Less
Submitted 24 January, 2024;
originally announced January 2024.
-
Stable partitions for proportional generalized claims problems
Authors:
Oihane Gallo,
Bettina Klaus
Abstract:
We consider a set of agents who have claims on an endowment that is not large enough to cover all claims. Agents can form coalitions but a minimal coalition size $θ$ is required to have positive coalitional funding that is proportional to the sum of the claims of its members. We analyze the structure of stable partitions when coalition members use well-behaved rules to allocate coalitional endowme…
▽ More
We consider a set of agents who have claims on an endowment that is not large enough to cover all claims. Agents can form coalitions but a minimal coalition size $θ$ is required to have positive coalitional funding that is proportional to the sum of the claims of its members. We analyze the structure of stable partitions when coalition members use well-behaved rules to allocate coalitional endowments, e.g., the well-known constrained equal awards rule (CEA) or the constrained equal losses rule (CEL).For continuous, (strictly) resource monotonic, and consistent rules, stable partitions with (mostly) $θ$-size coalitions emerge. For CEA and CEL we provide algorithms to construct such a stable partition formed by $θ$-size coalitions.
△ Less
Submitted 29 August, 2024; v1 submitted 7 November, 2023;
originally announced November 2023.
-
Zero-shot Pose Transfer for Unrigged Stylized 3D Characters
Authors:
Jiashun Wang,
Xueting Li,
Sifei Liu,
Shalini De Mello,
Orazio Gallo,
Xiaolong Wang,
Jan Kautz
Abstract:
Transferring the pose of a reference avatar to stylized 3D characters of various shapes is a fundamental task in computer graphics. Existing methods either require the stylized characters to be rigged, or they use the stylized character in the desired pose as ground truth at training. We present a zero-shot approach that requires only the widely available deformed non-stylized avatars in training,…
▽ More
Transferring the pose of a reference avatar to stylized 3D characters of various shapes is a fundamental task in computer graphics. Existing methods either require the stylized characters to be rigged, or they use the stylized character in the desired pose as ground truth at training. We present a zero-shot approach that requires only the widely available deformed non-stylized avatars in training, and deforms stylized characters of significantly different shapes at inference. Classical methods achieve strong generalization by deforming the mesh at the triangle level, but this requires labelled correspondences. We leverage the power of local deformation, but without requiring explicit correspondence labels. We introduce a semi-supervised shape-understanding module to bypass the need for explicit correspondences at test time, and an implicit pose deformation module that deforms individual surface points to match the target pose. Furthermore, to encourage realistic and accurate deformation of stylized characters, we introduce an efficient volume-based test-time training procedure. Because it does not need rigging, nor the deformed stylized character at training time, our model generalizes to categories with scarce annotation, such as stylized quadrupeds. Extensive experiments demonstrate the effectiveness of the proposed method compared to the state-of-the-art approaches trained with comparable or more supervision. Our project page is available at https://jiashunwang.github.io/ZPT
△ Less
Submitted 31 May, 2023;
originally announced June 2023.
-
Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos
Authors:
Ekta Prashnani,
Koki Nagano,
Shalini De Mello,
David Luebke,
Orazio Gallo
Abstract:
Modern avatar generators allow anyone to synthesize photorealistic real-time talking avatars, ushering in a new era of avatar-based human communication, such as with immersive AR/VR interactions or videoconferencing with limited bandwidths. Their safe adoption, however, requires a mechanism to verify if the rendered avatar is trustworthy: does it use the appearance of an individual without their c…
▽ More
Modern avatar generators allow anyone to synthesize photorealistic real-time talking avatars, ushering in a new era of avatar-based human communication, such as with immersive AR/VR interactions or videoconferencing with limited bandwidths. Their safe adoption, however, requires a mechanism to verify if the rendered avatar is trustworthy: does it use the appearance of an individual without their consent? We term this task avatar fingerprinting. To tackle it, we first introduce a large-scale dataset of real and synthetic videos of people interacting on a video call, where the synthetic videos are generated using the facial appearance of one person and the expressions of another. We verify the identity driving the expressions in a synthetic video, by learning motion signatures that are independent of the facial appearance shown. Our solution, the first in this space, achieves an average AUC of 0.85. Critical to its practical use, it also generalizes to new generators never seen in training (average AUC of 0.83). The proposed dataset and other resources can be found at: https://research.nvidia.com/labs/nxp/avatar-fingerprinting/.
△ Less
Submitted 4 August, 2024; v1 submitted 5 May, 2023;
originally announced May 2023.
-
Strategy-proofness with single-peaked and single-dipped preferences
Authors:
Jorge Alcalde-Unzu,
Oihane Gallo,
Marc Vorsatz
Abstract:
We analyze the problem of locating a public facility in a domain of single-peaked and single-dipped preferences when the social planner knows the type of preference (single-peaked or single-dipped) of each agent. Our main result characterizes all strategy-proof rules and shows that they can be decomposed into two steps. In the first step, the agents with single-peaked preferences are asked about t…
▽ More
We analyze the problem of locating a public facility in a domain of single-peaked and single-dipped preferences when the social planner knows the type of preference (single-peaked or single-dipped) of each agent. Our main result characterizes all strategy-proof rules and shows that they can be decomposed into two steps. In the first step, the agents with single-peaked preferences are asked about their peaks and, for each profile of reported peaks, at most two alternatives are preselected. In the second step, the agents with single-dipped preferences are asked to reveal their dips to complete the decision between the preselected alternatives. Our result generalizes the findings of Moulin (1980) and Barberà and Jackson (1994) for single-peaked and of Manjunath (2014) for single-dipped preferences. Finally, we show that all strategy-proof rules are also group strategy-proof and analyze the implications of Pareto efficiency.
△ Less
Submitted 30 March, 2024; v1 submitted 10 March, 2023;
originally announced March 2023.
-
Solidarity to achieve stability
Authors:
Jorge Alcalde-Unzu,
Oihane Gallo,
Elena Inarra,
Juan D. Moreno-Ternero
Abstract:
Agents may form coalitions. Each coalition shares its endowment among its agents by applying a sharing rule. The sharing rule induces a coalition formation problem by assuming that agents rank coalitions according to the allocation they obtain in the corresponding sharing problem. We characterize the sharing rules that induce a class of stable coalition formation problems as those that satisfy a n…
▽ More
Agents may form coalitions. Each coalition shares its endowment among its agents by applying a sharing rule. The sharing rule induces a coalition formation problem by assuming that agents rank coalitions according to the allocation they obtain in the corresponding sharing problem. We characterize the sharing rules that induce a class of stable coalition formation problems as those that satisfy a natural axiom that formalizes the principle of solidarity. Thus, solidarity becomes a sufficient condition to achieve stability.
△ Less
Submitted 5 October, 2023; v1 submitted 15 February, 2023;
originally announced February 2023.
-
Watch It Move: Unsupervised Discovery of 3D Joints for Re-Posing of Articulated Objects
Authors:
Atsuhiro Noguchi,
Umar Iqbal,
Jonathan Tremblay,
Tatsuya Harada,
Orazio Gallo
Abstract:
Rendering articulated objects while controlling their poses is critical to applications such as virtual reality or animation for movies. Manipulating the pose of an object, however, requires the understanding of its underlying structure, that is, its joints and how they interact with each other. Unfortunately, assuming the structure to be known, as existing methods do, precludes the ability to wor…
▽ More
Rendering articulated objects while controlling their poses is critical to applications such as virtual reality or animation for movies. Manipulating the pose of an object, however, requires the understanding of its underlying structure, that is, its joints and how they interact with each other. Unfortunately, assuming the structure to be known, as existing methods do, precludes the ability to work on new object categories. We propose to learn both the appearance and the structure of previously unseen articulated objects by observing them move from multiple views, with no joints annotation supervision, or information about the structure. We observe that 3D points that are static relative to one another should belong to the same part, and that adjacent parts that move relative to each other must be connected by a joint. To leverage this insight, we model the object parts in 3D as ellipsoids, which allows us to identify joints. We combine this explicit representation with an implicit one that compensates for the approximation introduced. We show that our method works for different structures, from quadrupeds, to single-arm robots, to humans.
△ Less
Submitted 6 April, 2022; v1 submitted 21 December, 2021;
originally announced December 2021.
-
Efficient Geometry-aware 3D Generative Adversarial Networks
Authors:
Eric R. Chan,
Connor Z. Lin,
Matthew A. Chan,
Koki Nagano,
Boxiao Pan,
Shalini De Mello,
Orazio Gallo,
Leonidas Guibas,
Jonathan Tremblay,
Sameh Khamis,
Tero Karras,
Gordon Wetzstein
Abstract:
Unsupervised generation of high-quality multi-view-consistent images and 3D shapes using only collections of single-view 2D photographs has been a long-standing challenge. Existing 3D GANs are either compute-intensive or make approximations that are not 3D-consistent; the former limits quality and resolution of the generated images and the latter adversely affects multi-view consistency and shape…
▽ More
Unsupervised generation of high-quality multi-view-consistent images and 3D shapes using only collections of single-view 2D photographs has been a long-standing challenge. Existing 3D GANs are either compute-intensive or make approximations that are not 3D-consistent; the former limits quality and resolution of the generated images and the latter adversely affects multi-view consistency and shape quality. In this work, we improve the computational efficiency and image quality of 3D GANs without overly relying on these approximations. We introduce an expressive hybrid explicit-implicit network architecture that, together with other design choices, synthesizes not only high-resolution multi-view-consistent images in real time but also produces high-quality 3D geometry. By decoupling feature generation and neural rendering, our framework is able to leverage state-of-the-art 2D CNN generators, such as StyleGAN2, and inherit their efficiency and expressiveness. We demonstrate state-of-the-art 3D-aware synthesis with FFHQ and AFHQ Cats, among other experiments.
△ Less
Submitted 27 April, 2022; v1 submitted 15 December, 2021;
originally announced December 2021.
-
Real-time ground filtering algorithm of cloud points acquired using Terrestrial Laser Scanner (TLS)
Authors:
Nelson Diaz,
Omar Gallo,
Jhon Caceres,
Hernan Porras
Abstract:
3D modeling based on point clouds requires ground-filtering algorithms that separate ground from non-ground objects. This study presents two ground filtering algorithms. The first one is based on normal vectors. It has two variants depending on the procedure to compute the k-nearest neighbors. The second algorithm is based on transforming the cloud points into a voxel structure. To evaluate them,…
▽ More
3D modeling based on point clouds requires ground-filtering algorithms that separate ground from non-ground objects. This study presents two ground filtering algorithms. The first one is based on normal vectors. It has two variants depending on the procedure to compute the k-nearest neighbors. The second algorithm is based on transforming the cloud points into a voxel structure. To evaluate them, the two algorithms are compared according to their execution time, effectiveness and efficiency. Results show that the ground filtering algorithm based on the voxel structure is faster in terms of execution time, effectiveness, and efficiency than the normal vector ground filtering.
△ Less
Submitted 22 November, 2021;
originally announced November 2021.
-
Neural Trajectory Fields for Dynamic Novel View Synthesis
Authors:
Chaoyang Wang,
Ben Eckart,
Simon Lucey,
Orazio Gallo
Abstract:
Recent approaches to render photorealistic views from a limited set of photographs have pushed the boundaries of our interactions with pictures of static scenes. The ability to recreate moments, that is, time-varying sequences, is perhaps an even more interesting scenario, but it remains largely unsolved. We introduce DCT-NeRF, a coordinatebased neural representation for dynamic scenes. DCTNeRF le…
▽ More
Recent approaches to render photorealistic views from a limited set of photographs have pushed the boundaries of our interactions with pictures of static scenes. The ability to recreate moments, that is, time-varying sequences, is perhaps an even more interesting scenario, but it remains largely unsolved. We introduce DCT-NeRF, a coordinatebased neural representation for dynamic scenes. DCTNeRF learns smooth and stable trajectories over the input sequence for each point in space. This allows us to enforce consistency between any two frames in the sequence, which results in high quality reconstruction, particularly in dynamic regions.
△ Less
Submitted 12 May, 2021;
originally announced May 2021.
-
Noise-Aware Video Saliency Prediction
Authors:
Ekta Prashnani,
Orazio Gallo,
Joohwan Kim,
Josef Spjut,
Pradeep Sen,
Iuri Frosio
Abstract:
We tackle the problem of predicting saliency maps for videos of dynamic scenes. We note that the accuracy of the maps reconstructed from the gaze data of a fixed number of observers varies with the frame, as it depends on the content of the scene. This issue is particularly pressing when a limited number of observers are available. In such cases, directly minimizing the discrepancy between the pre…
▽ More
We tackle the problem of predicting saliency maps for videos of dynamic scenes. We note that the accuracy of the maps reconstructed from the gaze data of a fixed number of observers varies with the frame, as it depends on the content of the scene. This issue is particularly pressing when a limited number of observers are available. In such cases, directly minimizing the discrepancy between the predicted and measured saliency maps, as traditional deep-learning methods do, results in overfitting to the noisy data. We propose a noise-aware training (NAT) paradigm that quantifies and accounts for the uncertainty arising from frame-specific gaze data inaccuracy. We show that NAT is especially advantageous when limited training data is available, with experiments across different models, loss functions, and datasets. We also introduce a video game-based saliency dataset, with rich temporal semantics, and multiple gaze attractors per frame. The dataset and source code are available at https://github.com/NVlabs/NAT-saliency.
△ Less
Submitted 22 November, 2021; v1 submitted 16 April, 2021;
originally announced April 2021.
-
Binary TTC: A Temporal Geofence for Autonomous Navigation
Authors:
Abhishek Badki,
Orazio Gallo,
Jan Kautz,
Pradeep Sen
Abstract:
Time-to-contact (TTC), the time for an object to collide with the observer's plane, is a powerful tool for path planning: it is potentially more informative than the depth, velocity, and acceleration of objects in the scene -- even for humans. TTC presents several advantages, including requiring only a monocular, uncalibrated camera. However, regressing TTC for each pixel is not straightforward, a…
▽ More
Time-to-contact (TTC), the time for an object to collide with the observer's plane, is a powerful tool for path planning: it is potentially more informative than the depth, velocity, and acceleration of objects in the scene -- even for humans. TTC presents several advantages, including requiring only a monocular, uncalibrated camera. However, regressing TTC for each pixel is not straightforward, and most existing methods make over-simplifying assumptions about the scene. We address this challenge by estimating TTC via a series of simpler, binary classifications. We predict with low latency whether the observer will collide with an obstacle within a certain time, which is often more critical than knowing exact, per-pixel TTC. For such scenarios, our method offers a temporal geofence in 6.4 ms -- over 25x faster than existing methods. Our approach can also estimate per-pixel TTC with arbitrarily fine quantization (including continuous values), when the computational budget allows for it. To the best of our knowledge, our method is the first to offer TTC information (binary or coarsely quantized) at sufficiently high frame-rates for practical use.
△ Less
Submitted 28 April, 2021; v1 submitted 12 January, 2021;
originally announced January 2021.
-
Generative View Synthesis: From Single-view Semantics to Novel-view Images
Authors:
Tewodros Habtegebrial,
Varun Jampani,
Orazio Gallo,
Didier Stricker
Abstract:
Content creation, central to applications such as virtual reality, can be a tedious and time-consuming. Recent image synthesis methods simplify this task by offering tools to generate new views from as little as a single input image, or by converting a semantic map into a photorealistic image. We propose to push the envelope further, and introduce Generative View Synthesis (GVS), which can synthes…
▽ More
Content creation, central to applications such as virtual reality, can be a tedious and time-consuming. Recent image synthesis methods simplify this task by offering tools to generate new views from as little as a single input image, or by converting a semantic map into a photorealistic image. We propose to push the envelope further, and introduce Generative View Synthesis (GVS), which can synthesize multiple photorealistic views of a scene given a single semantic map. We show that the sequential application of existing techniques, e.g., semantics-to-image translation followed by monocular view synthesis, fail at capturing the scene's structure. In contrast, we solve the semantics-to-image translation in concert with the estimation of the 3D layout of the scene, thus producing geometrically consistent novel views that preserve semantic structures. We first lift the input 2D semantic map onto a 3D layered representation of the scene in feature space, thereby preserving the semantic labels of 3D geometric structures. We then project the layered features onto the target views to generate the final novel-view images. We verify the strengths of our method and compare it with several advanced baselines on three different datasets. Our approach also allows for style manipulation and image editing operations, such as the addition or removal of objects, with simple manipulations of the input style images and semantic maps respectively. Visit the project page at https://gvsnet.github.io.
△ Less
Submitted 2 October, 2020; v1 submitted 20 August, 2020;
originally announced August 2020.
-
Bi3D: Stereo Depth Estimation via Binary Classifications
Authors:
Abhishek Badki,
Alejandro Troccoli,
Kihwan Kim,
Jan Kautz,
Pradeep Sen,
Orazio Gallo
Abstract:
Stereo-based depth estimation is a cornerstone of computer vision, with state-of-the-art methods delivering accurate results in real time. For several applications such as autonomous navigation, however, it may be useful to trade accuracy for lower latency. We present Bi3D, a method that estimates depth via a series of binary classifications. Rather than testing if objects are at a particular dept…
▽ More
Stereo-based depth estimation is a cornerstone of computer vision, with state-of-the-art methods delivering accurate results in real time. For several applications such as autonomous navigation, however, it may be useful to trade accuracy for lower latency. We present Bi3D, a method that estimates depth via a series of binary classifications. Rather than testing if objects are at a particular depth $D$, as existing stereo methods do, it classifies them as being closer or farther than $D$. This property offers a powerful mechanism to balance accuracy and latency. Given a strict time budget, Bi3D can detect objects closer than a given distance in as little as a few milliseconds, or estimate depth with arbitrarily coarse quantization, with complexity linear with the number of quantization levels. Bi3D can also use the allotted quantization levels to get continuous depth, but in a specific depth range. For standard stereo (i.e., continuous depth on the whole range), our method is close to or on par with state-of-the-art, finely tuned stereo methods.
△ Less
Submitted 1 June, 2020; v1 submitted 14 May, 2020;
originally announced May 2020.
-
Novel View Synthesis of Dynamic Scenes with Globally Coherent Depths from a Monocular Camera
Authors:
Jae Shin Yoon,
Kihwan Kim,
Orazio Gallo,
Hyun Soo Park,
Jan Kautz
Abstract:
This paper presents a new method to synthesize an image from arbitrary views and times given a collection of images of a dynamic scene. A key challenge for the novel view synthesis arises from dynamic scene reconstruction where epipolar geometry does not apply to the local motion of dynamic contents. To address this challenge, we propose to combine the depth from single view (DSV) and the depth fr…
▽ More
This paper presents a new method to synthesize an image from arbitrary views and times given a collection of images of a dynamic scene. A key challenge for the novel view synthesis arises from dynamic scene reconstruction where epipolar geometry does not apply to the local motion of dynamic contents. To address this challenge, we propose to combine the depth from single view (DSV) and the depth from multi-view stereo (DMV), where DSV is complete, i.e., a depth is assigned to every pixel, yet view-variant in its scale, while DMV is view-invariant yet incomplete. Our insight is that although its scale and quality are inconsistent with other views, the depth estimation from a single view can be used to reason about the globally coherent geometry of dynamic contents. We cast this problem as learning to correct the scale of DSV, and to refine each depth with locally consistent motions between views to form a coherent depth estimation. We integrate these tasks into a depth fusion network in a self-supervised fashion. Given the fused depth maps, we synthesize a photorealistic virtual view in a specific location and time with our deep blending network that completes the scene and renders the virtual view. We evaluate our method of depth estimation and view synthesis on diverse real-world dynamic scenes and show the outstanding performance over existing methods.
△ Less
Submitted 2 April, 2020;
originally announced April 2020.
-
Meshlet Priors for 3D Mesh Reconstruction
Authors:
Abhishek Badki,
Orazio Gallo,
Jan Kautz,
Pradeep Sen
Abstract:
Estimating a mesh from an unordered set of sparse, noisy 3D points is a challenging problem that requires carefully selected priors. Existing hand-crafted priors, such as smoothness regularizers, impose an undesirable trade-off between attenuating noise and preserving local detail. Recent deep-learning approaches produce impressive results by learning priors directly from the data. However, the pr…
▽ More
Estimating a mesh from an unordered set of sparse, noisy 3D points is a challenging problem that requires carefully selected priors. Existing hand-crafted priors, such as smoothness regularizers, impose an undesirable trade-off between attenuating noise and preserving local detail. Recent deep-learning approaches produce impressive results by learning priors directly from the data. However, the priors are learned at the object level, which makes these algorithms class-specific and even sensitive to the pose of the object. We introduce meshlets, small patches of mesh that we use to learn local shape priors. Meshlets act as a dictionary of local features and thus allow to use learned priors to reconstruct object meshes in any pose and from unseen classes, even when the noise is large and the samples sparse.
△ Less
Submitted 1 June, 2020; v1 submitted 6 January, 2020;
originally announced January 2020.
-
Video Stitching for Linear Camera Arrays
Authors:
Wei-Sheng Lai,
Orazio Gallo,
Jinwei Gu,
Deqing Sun,
Ming-Hsuan Yang,
Jan Kautz
Abstract:
Despite the long history of image and video stitching research, existing academic and commercial solutions still produce strong artifacts. In this work, we propose a wide-baseline video stitching algorithm for linear camera arrays that is temporally stable and tolerant to strong parallax. Our key insight is that stitching can be cast as a problem of learning a smooth spatial interpolation between…
▽ More
Despite the long history of image and video stitching research, existing academic and commercial solutions still produce strong artifacts. In this work, we propose a wide-baseline video stitching algorithm for linear camera arrays that is temporally stable and tolerant to strong parallax. Our key insight is that stitching can be cast as a problem of learning a smooth spatial interpolation between the input videos. To solve this problem, inspired by pushbroom cameras, we introduce a fast pushbroom interpolation layer and propose a novel pushbroom stitching network, which learns a dense flow field to smoothly align the multiple input videos for spatial interpolation. Our approach outperforms the state-of-the-art by a significant margin, as we show with a user study, and has immediate applications in many areas such as virtual reality, immersive telepresence, autonomous driving, and video surveillance.
△ Less
Submitted 31 July, 2019;
originally announced July 2019.
-
Pixel-Adaptive Convolutional Neural Networks
Authors:
Hang Su,
Varun Jampani,
Deqing Sun,
Orazio Gallo,
Erik Learned-Miller,
Jan Kautz
Abstract:
Convolutions are the fundamental building block of CNNs. The fact that their weights are spatially shared is one of the main reasons for their widespread use, but it also is a major limitation, as it makes convolutions content agnostic. We propose a pixel-adaptive convolution (PAC) operation, a simple yet effective modification of standard convolutions, in which the filter weights are multiplied w…
▽ More
Convolutions are the fundamental building block of CNNs. The fact that their weights are spatially shared is one of the main reasons for their widespread use, but it also is a major limitation, as it makes convolutions content agnostic. We propose a pixel-adaptive convolution (PAC) operation, a simple yet effective modification of standard convolutions, in which the filter weights are multiplied with a spatially-varying kernel that depends on learnable, local pixel features. PAC is a generalization of several popular filtering techniques and thus can be used for a wide range of use cases. Specifically, we demonstrate state-of-the-art performance when PAC is used for deep joint image upsampling. PAC also offers an effective alternative to fully-connected CRF (Full-CRF), called PAC-CRF, which performs competitively, while being considerably faster. In addition, we also demonstrate that PAC can be used as a drop-in replacement for convolution layers in pre-trained networks, resulting in consistent performance improvements.
△ Less
Submitted 10 April, 2019;
originally announced April 2019.
-
Extreme View Synthesis
Authors:
Inchang Choi,
Orazio Gallo,
Alejandro Troccoli,
Min H. Kim,
Jan Kautz
Abstract:
We present Extreme View Synthesis, a solution for novel view extrapolation that works even when the number of input images is small--as few as two. In this context, occlusions and depth uncertainty are two of the most pressing issues, and worsen as the degree of extrapolation increases. We follow the traditional paradigm of performing depth-based warping and refinement, with a few key improvements…
▽ More
We present Extreme View Synthesis, a solution for novel view extrapolation that works even when the number of input images is small--as few as two. In this context, occlusions and depth uncertainty are two of the most pressing issues, and worsen as the degree of extrapolation increases. We follow the traditional paradigm of performing depth-based warping and refinement, with a few key improvements. First, we estimate a depth probability volume, rather than just a single depth value for each pixel of the novel view. This allows us to leverage depth uncertainty in challenging regions, such as depth discontinuities. After using it to get an initial estimate of the novel view, we explicitly combine learned image priors and the depth uncertainty to synthesize a refined image with less artifacts. Our method is the first to show visually pleasing results for baseline magnifications of up to 30X.
△ Less
Submitted 29 August, 2019; v1 submitted 11 December, 2018;
originally announced December 2018.
-
A Fusion Approach for Multi-Frame Optical Flow Estimation
Authors:
Zhile Ren,
Orazio Gallo,
Deqing Sun,
Ming-Hsuan Yang,
Erik B. Sudderth,
Jan Kautz
Abstract:
To date, top-performing optical flow estimation methods only take pairs of consecutive frames into account. While elegant and appealing, the idea of using more than two frames has not yet produced state-of-the-art results. We present a simple, yet effective fusion approach for multi-frame optical flow that benefits from longer-term temporal cues. Our method first warps the optical flow from previo…
▽ More
To date, top-performing optical flow estimation methods only take pairs of consecutive frames into account. While elegant and appealing, the idea of using more than two frames has not yet produced state-of-the-art results. We present a simple, yet effective fusion approach for multi-frame optical flow that benefits from longer-term temporal cues. Our method first warps the optical flow from previous frames to the current, thereby yielding multiple plausible estimates. It then fuses the complementary information carried by these estimates into a new optical flow field. At the time of writing, our method ranks first among published results in the MPI Sintel and KITTI 2015 benchmarks. Our models will be available on https://github.com/NVlabs/PWC-Net.
△ Less
Submitted 29 November, 2018; v1 submitted 23 October, 2018;
originally announced October 2018.
-
Tackling 3D ToF Artifacts Through Learning and the FLAT Dataset
Authors:
Qi Guo,
Iuri Frosio,
Orazio Gallo,
Todd Zickler,
Jan Kautz
Abstract:
Scene motion, multiple reflections, and sensor noise introduce artifacts in the depth reconstruction performed by time-of-flight cameras. We propose a two-stage, deep-learning approach to address all of these sources of artifacts simultaneously. We also introduce FLAT, a synthetic dataset of 2000 ToF measurements that capture all of these nonidealities, and allows to simulate different camera hard…
▽ More
Scene motion, multiple reflections, and sensor noise introduce artifacts in the depth reconstruction performed by time-of-flight cameras. We propose a two-stage, deep-learning approach to address all of these sources of artifacts simultaneously. We also introduce FLAT, a synthetic dataset of 2000 ToF measurements that capture all of these nonidealities, and allows to simulate different camera hardware. Using the Kinect 2 camera as a baseline, we show improved reconstruction errors over state-of-the-art methods, on both simulated and real data.
△ Less
Submitted 26 July, 2018;
originally announced July 2018.
-
Reblur2Deblur: Deblurring Videos via Self-Supervised Learning
Authors:
Huaijin Chen,
Jinwei Gu,
Orazio Gallo,
Ming-Yu Liu,
Ashok Veeraraghavan,
Jan Kautz
Abstract:
Motion blur is a fundamental problem in computer vision as it impacts image quality and hinders inference. Traditional deblurring algorithms leverage the physics of the image formation model and use hand-crafted priors: they usually produce results that better reflect the underlying scene, but present artifacts. Recent learning-based methods implicitly extract the distribution of natural images di…
▽ More
Motion blur is a fundamental problem in computer vision as it impacts image quality and hinders inference. Traditional deblurring algorithms leverage the physics of the image formation model and use hand-crafted priors: they usually produce results that better reflect the underlying scene, but present artifacts. Recent learning-based methods implicitly extract the distribution of natural images directly from the data and use it to synthesize plausible images. Their results are impressive, but they are not always faithful to the content of the latent image. We present an approach that bridges the two. Our method fine-tunes existing deblurring neural networks in a self-supervised fashion by enforcing that the output, when blurred based on the optical flow between subsequent frames, matches the input blurry image. We show that our method significantly improves the performance of existing methods on several datasets both visually and in terms of image quality metrics. The supplementary material is https://goo.gl/nYPjEQ
△ Less
Submitted 16 January, 2018;
originally announced January 2018.
-
Separating Reflection and Transmission Images in the Wild
Authors:
Patrick Wieschollek,
Orazio Gallo,
Jinwei Gu,
Jan Kautz
Abstract:
The reflections caused by common semi-reflectors, such as glass windows, can impact the performance of computer vision algorithms. State-of-the-art methods can remove reflections on synthetic data and in controlled scenarios. However, they are based on strong assumptions and do not generalize well to real-world images. Contrary to a common misconception, real-world images are challenging even when…
▽ More
The reflections caused by common semi-reflectors, such as glass windows, can impact the performance of computer vision algorithms. State-of-the-art methods can remove reflections on synthetic data and in controlled scenarios. However, they are based on strong assumptions and do not generalize well to real-world images. Contrary to a common misconception, real-world images are challenging even when polarization information is used. We present a deep learning approach to separate the reflected and the transmitted components of the recorded irradiance, which explicitly uses the polarization properties of light. To train it, we introduce an accurate synthetic data generation pipeline, which simulates realistic reflections, including those generated by curved and non-ideal surfaces, non-static scenes, and high-dynamic-range scenes.
△ Less
Submitted 16 August, 2018; v1 submitted 6 December, 2017;
originally announced December 2017.
-
Deep Learning with Energy-efficient Binary Gradient Cameras
Authors:
Suren Jayasuriya,
Orazio Gallo,
Jinwei Gu,
Jan Kautz
Abstract:
Power consumption is a critical factor for the deployment of embedded computer vision systems. We explore the use of computational cameras that directly output binary gradient images to reduce the portion of the power consumption allocated to image sensing. We survey the accuracy of binary gradient cameras on a number of computer vision tasks using deep learning. These include object recognition,…
▽ More
Power consumption is a critical factor for the deployment of embedded computer vision systems. We explore the use of computational cameras that directly output binary gradient images to reduce the portion of the power consumption allocated to image sensing. We survey the accuracy of binary gradient cameras on a number of computer vision tasks using deep learning. These include object recognition, head pose regression, face detection, and gesture recognition. We show that, for certain applications, accuracy can be on par or even better than what can be achieved on traditional images. We are also the first to recover intensity information from binary spatial gradient images--useful for applications with a human observer in the loop, such as surveillance. Our results, which we validate with a prototype binary gradient camera, point to the potential of gradient-based computer vision systems.
△ Less
Submitted 3 December, 2016;
originally announced December 2016.
-
Loss Functions for Neural Networks for Image Processing
Authors:
Hang Zhao,
Orazio Gallo,
Iuri Frosio,
Jan Kautz
Abstract:
Neural networks are becoming central in several areas of computer vision and image processing and different architectures have been proposed to solve specific problems. The impact of the loss layer of neural networks, however, has not received much attention in the context of image processing: the default and virtually only choice is L2. In this paper, we bring attention to alternative choices for…
▽ More
Neural networks are becoming central in several areas of computer vision and image processing and different architectures have been proposed to solve specific problems. The impact of the loss layer of neural networks, however, has not received much attention in the context of image processing: the default and virtually only choice is L2. In this paper, we bring attention to alternative choices for image restoration. In particular, we show the importance of perceptually-motivated losses when the resulting image is to be evaluated by a human observer. We compare the performance of several losses, and propose a novel, differentiable error function. We show that the quality of the results improves significantly with better loss functions, even when the network architecture is left unchanged.
△ Less
Submitted 20 April, 2018; v1 submitted 27 November, 2015;
originally announced November 2015.
-
Locally Non-rigid Registration for Mobile HDR Photography
Authors:
Orazio Gallo,
Alejandro Troccoli,
Jun Hu,
Kari Pulli,
Jan Kautz
Abstract:
Image registration for stack-based HDR photography is challenging. If not properly accounted for, camera motion and scene changes result in artifacts in the composite image. Unfortunately, existing methods to address this problem are either accurate, but too slow for mobile devices, or fast, but prone to failing. We propose a method that fills this void: our approach is extremely fast---under 700m…
▽ More
Image registration for stack-based HDR photography is challenging. If not properly accounted for, camera motion and scene changes result in artifacts in the composite image. Unfortunately, existing methods to address this problem are either accurate, but too slow for mobile devices, or fast, but prone to failing. We propose a method that fills this void: our approach is extremely fast---under 700ms on a commercial tablet for a pair of 5MP images---and prevents the artifacts that arise from insufficient registration quality.
△ Less
Submitted 4 May, 2015; v1 submitted 6 April, 2015;
originally announced April 2015.