Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images using a View-based Representation

Rajeswar, Sai; Mannan, Fahim; Golemo, Florian; Parent-Lévesque, Jérôme; Vazquez, David; Nowrouzezahrai, Derek; Courville, Aaron

doi:10.1007/s11263-020-01322-1

Computer Science > Computer Vision and Pattern Recognition

arXiv:2003.14166 (cs)

[Submitted on 23 Mar 2020 (v1), last revised 17 Apr 2020 (this version, v2)]

Title:Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images using a View-based Representation

Authors:Sai Rajeswar, Fahim Mannan, Florian Golemo, Jérôme Parent-Lévesque, David Vazquez, Derek Nowrouzezahrai, Aaron Courville

View PDF

Abstract:We infer and generate three-dimensional (3D) scene information from a single input image and without supervision. This problem is under-explored, with most prior work relying on supervision from, e.g., 3D ground-truth, multiple images of a scene, image silhouettes or key-points. We propose Pix2Shape, an approach to solve this problem with four components: (i) an encoder that infers the latent 3D representation from an image, (ii) a decoder that generates an explicit 2.5D surfel-based reconstruction of a scene from the latent code (iii) a differentiable renderer that synthesizes a 2D image from the surfel representation, and (iv) a critic network trained to discriminate between images generated by the decoder-renderer and those from a training distribution. Pix2Shape can generate complex 3D scenes that scale with the view-dependent on-screen resolution, unlike representations that capture world-space resolution, i.e., voxels or meshes. We show that Pix2Shape learns a consistent scene representation in its encoded latent space and that the decoder can then be applied to this latent representation in order to synthesize the scene from a novel viewpoint. We evaluate Pix2Shape with experiments on the ShapeNet dataset as well as on a novel benchmark we developed, called 3D-IQTT, to evaluate models based on their ability to enable 3d spatial reasoning. Qualitative and quantitative evaluation demonstrate Pix2Shape's ability to solve scene reconstruction, generation, and understanding tasks.

Comments:	This is a pre-print of an article published in International Journal of Computer Vision. The final authenticated version is available online at: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2003.14166 [cs.CV]
	(or arXiv:2003.14166v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2003.14166
Journal reference:	International Journal of Computer Vision, (2020), 1-16
Related DOI:	https://doi.org/10.1007/s11263-020-01322-1

Submission history

From: Florian Golemo [view email]
[v1] Mon, 23 Mar 2020 03:01:34 UTC (9,716 KB)
[v2] Fri, 17 Apr 2020 13:22:58 UTC (9,716 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images using a View-based Representation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images using a View-based Representation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators