Skip to main content

Showing 1–50 of 66 results for author: Chang, A X

.
  1. arXiv:2503.16848  [pdf, other

    cs.GR cs.CV

    HSM: Hierarchical Scene Motifs for Multi-Scale Indoor Scene Generation

    Authors: Hou In Derek Pun, Hou In Ivan Tam, Austin T. Wang, Xiaoliang Huo, Angel X. Chang, Manolis Savva

    Abstract: Despite advances in indoor 3D scene layout generation, synthesizing scenes with dense object arrangements remains challenging. Existing methods primarily focus on large furniture while neglecting smaller objects, resulting in unrealistically empty scenes. Those that place small objects typically do not honor arrangement specifications, resulting in largely random placement not following the text d… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

    Comments: 23 pages, 7 figures

  2. arXiv:2503.16375  [pdf, other

    cs.CV

    NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes

    Authors: Han-Hung Lee, Qinghong Han, Angel X. Chang

    Abstract: In this paper, we explore the task of generating expansive outdoor scenes, ranging from castles to high-rises. Unlike indoor scene generation, which has been a primary focus of prior work, outdoor scene generation presents unique challenges, including wide variations in scene heights and the need for a method capable of rapidly producing large landscapes. To address this, we propose an efficient a… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

  3. arXiv:2503.14756  [pdf, ps, other

    cs.GR cs.CV

    SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis

    Authors: Hou In Ivan Tam, Hou In Derek Pun, Austin T. Wang, Angel X. Chang, Manolis Savva

    Abstract: Despite recent advances in text-conditioned 3D indoor scene generation, there remain gaps in the evaluation of these methods. Existing metrics primarily assess the realism of generated scenes by comparing them to a set of ground-truth scenes, often overlooking alignment with the input text - a critical factor in determining how effectively a method meets user requirements. We present SceneEval, an… ▽ More

    Submitted 11 June, 2025; v1 submitted 18 March, 2025; originally announced March 2025.

    Comments: Expanded dataset to 500 annotated scene descriptions with new scene types; added validation via extended manual evaluation and a new user study; clarified distinctions from prior metrics; included results using an open-source VLM; stated intent to release code and data; corrected terminology and typos. 24 pages with 8 figures and 6 tables

  4. arXiv:2503.04496  [pdf, other

    cs.GR cs.CV cs.LG

    Learning Object Placement Programs for Indoor Scene Synthesis with Iterative Self Training

    Authors: Adrian Chang, Kai Wang, Yuanbo Li, Manolis Savva, Angel X. Chang, Daniel Ritchie

    Abstract: Data driven and autoregressive indoor scene synthesis systems generate indoor scenes automatically by suggesting and then placing objects one at a time. Empirical observations show that current systems tend to produce incomplete next object location distributions. We introduce a system which addresses this problem. We design a Domain Specific Language (DSL) that specifies functional constraints. P… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

    Comments: 21 pages, 20 figures Subjects: Graphics (cs.GR), Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG)

    ACM Class: I.3.6

  5. arXiv:2502.18405  [pdf, other

    cs.LG

    Enhancing DNA Foundation Models to Address Masking Inefficiencies

    Authors: Monireh Safari, Pablo Millan Arias, Scott C. Lowe, Lila Kari, Angel X. Chang, Graham W. Taylor

    Abstract: Masked language modelling (MLM) as a pretraining objective has been widely adopted in genomic sequence modelling. While pretrained models can successfully serve as encoders for various downstream tasks, the distribution shift between pretraining and inference detrimentally impacts performance, as the pretraining task is to map [MASK] tokens to predictions, yet the [MASK] is absent during downstrea… ▽ More

    Submitted 25 February, 2025; originally announced February 2025.

    Comments: 10 pages, 5 figures

  6. arXiv:2501.05750  [pdf, other

    cs.RO cs.CV

    Semantic Mapping in Indoor Embodied AI -- A Comprehensive Survey and Future Directions

    Authors: Sonia Raychaudhuri, Angel X. Chang

    Abstract: Intelligent embodied agents (e.g. robots) need to perform complex semantic tasks in unfamiliar environments. Among many skills that the agents need to possess, building and maintaining a semantic map of the environment is most crucial in long-horizon tasks. A semantic map captures information about the environment in a structured way, allowing the agent to reference it for advanced reasoning throu… ▽ More

    Submitted 10 January, 2025; originally announced January 2025.

  7. arXiv:2501.01366  [pdf, other

    cs.CV cs.AI cs.CL

    ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding

    Authors: Austin T. Wang, ZeMing Gong, Angel X. Chang

    Abstract: 3D visual grounding (3DVG) involves localizing entities in a 3D scene referred to by natural language text. Such models are useful for embodied AI and scene retrieval applications, which involve searching for objects or patterns using natural language descriptions. While recent works have focused on LLM-based scaling of 3DVG datasets, these datasets do not capture the full range of potential promp… ▽ More

    Submitted 2 January, 2025; originally announced January 2025.

    Comments: 20 pages with 5 figures and 11 tables

  8. arXiv:2411.19492  [pdf, other

    cs.CV cs.LG

    Diorama: Unleashing Zero-shot Single-view 3D Indoor Scene Modeling

    Authors: Qirui Wu, Denys Iliash, Daniel Ritchie, Manolis Savva, Angel X. Chang

    Abstract: Reconstructing structured 3D scenes from RGB images using CAD objects unlocks efficient and compact scene representations that maintain compositionality and interactability. Existing works propose training-heavy methods relying on either expensive yet inaccurate real-world annotations or controllable yet monotonous synthetic data that do not generalize well to unseen objects or domains. We present… ▽ More

    Submitted 14 March, 2025; v1 submitted 29 November, 2024; originally announced November 2024.

  9. arXiv:2411.07848  [pdf, other

    cs.RO cs.CV

    Zero-shot Object-Centric Instruction Following: Integrating Foundation Models with Traditional Navigation

    Authors: Sonia Raychaudhuri, Duy Ta, Katrina Ashton, Angel X. Chang, Jiuguang Wang, Bernadette Bucher

    Abstract: Large scale scenes such as multifloor homes can be robustly and efficiently mapped with a 3D graph of landmarks estimated jointly with robot poses in a factor graph, a technique commonly used in commercial robots such as drones and robot vacuums. In this work, we propose Language-Inferred Factor Graph for Instruction Following (LIFGIF), a zero-shot method to ground natural language instructions in… ▽ More

    Submitted 7 May, 2025; v1 submitted 12 November, 2024; originally announced November 2024.

  10. arXiv:2410.16499  [pdf, other

    cs.CV

    SINGAPO: Single Image Controlled Generation of Articulated Parts in Objects

    Authors: Jiayi Liu, Denys Iliash, Angel X. Chang, Manolis Savva, Ali Mahdavi-Amiri

    Abstract: We address the challenge of creating 3D assets for household articulated objects from a single image. Prior work on articulated object creation either requires multi-view multi-state input, or only allows coarse control over the generation process. These limitations hinder the scalability and practicality for articulated object modeling. In this work, we propose a method to generate articulated ob… ▽ More

    Submitted 19 March, 2025; v1 submitted 21 October, 2024; originally announced October 2024.

    Comments: Project page: https://3dlg-hcvc.github.io/singapo

  11. arXiv:2409.18896  [pdf, other

    cs.CV

    S2O: Static to Openable Enhancement for Articulated 3D Objects

    Authors: Denys Iliash, Hanxiao Jiang, Yiming Zhang, Manolis Savva, Angel X. Chang

    Abstract: Despite much progress in large 3D datasets there are currently few interactive 3D object datasets, and their scale is limited due to the manual effort required in their construction. We introduce the static to openable (S2O) task which creates interactive articulated 3D objects from static counterparts through openable part detection, motion prediction, and interior geometry completion. We formula… ▽ More

    Submitted 15 March, 2025; v1 submitted 27 September, 2024; originally announced September 2024.

  12. arXiv:2408.16059  [pdf, other

    cond-mat.soft physics.app-ph

    Textile hinges enable extreme properties of mechanical metamaterials

    Authors: A. S. Meeussen, G. Bordiga, A. X. Chang, B. Spoettling, K. P. Becker, L. Mahadevan, K. Bertoldi

    Abstract: Mechanical metamaterials -- structures with unusual properties that emerge from their internal architecture -- that are designed to undergo large deformations typically exploit large internal rotations, and therefore, necessitate the incorporation of flexible hinges. In the mechanism limit, these metamaterials consist of rigid bodies connected by ideal hinges that deform at zero energy cost. Howev… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

  13. arXiv:2408.03178  [pdf, other

    cs.CV cs.GR cs.LG

    An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion

    Authors: Xingguang Yan, Han-Hung Lee, Ziyu Wan, Angel X. Chang

    Abstract: We introduce a new approach for generating realistic 3D models with UV maps through a representation termed "Object Images." This approach encapsulates surface geometry, appearance, and patch structures within a 64x64 pixel image, effectively converting complex 3D shapes into a more manageable 2D format. By doing so, we address the challenges of both geometric and semantic irregularity inherent in… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

    Comments: Project Page: https://omages.github.io/

  14. arXiv:2408.02211  [pdf, ps, other

    cs.GR

    SceneMotifCoder: Example-driven Visual Program Learning for Generating 3D Object Arrangements

    Authors: Hou In Ivan Tam, Hou In Derek Pun, Austin T. Wang, Angel X. Chang, Manolis Savva

    Abstract: Despite advances in text-to-3D generation methods, generation of multi-object arrangements remains challenging. Current methods exhibit failures in generating physically plausible arrangements that respect the provided text description. We present SceneMotifCoder (SMC), an example-driven framework for generating 3D object arrangements through visual program learning. SMC leverages large language m… ▽ More

    Submitted 3 June, 2025; v1 submitted 4 August, 2024; originally announced August 2024.

    Comments: Accepted at 3DV 2025 (Oral). Project page: https://3dlg-hcvc.github.io/smc/. Minor revisions for camera-ready version

  15. arXiv:2406.12723  [pdf, other

    cs.LG cs.AI cs.CV q-bio.PE

    BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity

    Authors: Zahra Gharaee, Scott C. Lowe, ZeMing Gong, Pablo Millan Arias, Nicholas Pellegrino, Austin T. Wang, Joakim Bruslund Haurum, Iuliia Zarubiieva, Lila Kari, Dirk Steinke, Graham W. Taylor, Paul Fieguth, Angel X. Chang

    Abstract: As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, this paper presents the BIOSCAN-5M Insect dataset to the machine learning community and establish several benchmark tasks. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by includin… ▽ More

    Submitted 28 February, 2025; v1 submitted 18 June, 2024; originally announced June 2024.

    Journal ref: NeurIPS 2024

  16. arXiv:2406.11579  [pdf, other

    cs.CV

    Duoduo CLIP: Efficient 3D Understanding with Multi-View Images

    Authors: Han-Hung Lee, Yiming Zhang, Angel X. Chang

    Abstract: We introduce Duoduo CLIP, a model for 3D representation learning that learns shape encodings from multi-view images instead of point clouds. The choice of multi-view images allows us to leverage 2D priors from off-the-shelf CLIP models to facilitate fine-tuning with 3D data. Our approach not only shows better generalization compared to existing point cloud methods, but also reduces GPU requirement… ▽ More

    Submitted 19 March, 2025; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: ICLR 2025

  17. arXiv:2405.17537  [pdf, other

    cs.AI cs.CL cs.CV

    CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale

    Authors: ZeMing Gong, Austin T. Wang, Xiaoliang Huo, Joakim Bruslund Haurum, Scott C. Lowe, Graham W. Taylor, Angel X. Chang

    Abstract: Measuring biodiversity is crucial for understanding ecosystem health. While prior works have developed machine learning models for taxonomic classification of photographic images and DNA separately, in this work, we introduce a multimodal approach combining both, using CLIP-style contrastive learning to align images, barcode DNA, and text-based representations of taxonomic labels in a unified embe… ▽ More

    Submitted 2 April, 2025; v1 submitted 27 May, 2024; originally announced May 2024.

    Comments: 31 pages with 14 figures

  18. arXiv:2405.10255  [pdf, other

    cs.CV cs.RO

    When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models

    Authors: Xianzheng Ma, Yash Bhalgat, Brandon Smart, Shuai Chen, Xinghui Li, Jian Ding, Jindong Gu, Dave Zhenyu Chen, Songyou Peng, Jia-Wang Bian, Philip H Torr, Marc Pollefeys, Matthias Nießner, Ian D Reid, Angel X. Chang, Iro Laina, Victor Adrian Prisacariu

    Abstract: As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context lear… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

  19. arXiv:2405.05010  [pdf, other

    cs.CV

    ${M^2D}$NeRF: Multi-Modal Decomposition NeRF with 3D Feature Fields

    Authors: Ning Wang, Lefei Zhang, Angel X Chang

    Abstract: Neural fields (NeRF) have emerged as a promising approach for representing continuous 3D scenes. Nevertheless, the lack of semantic encoding in NeRFs poses a significant challenge for scene decomposition. To address this challenge, we present a single model, Multi-Modal Decomposition NeRF (${M^2D}$NeRF), that is capable of both text-based and visual patch-based edits. Specifically, we use multi-mo… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

  20. arXiv:2403.13289  [pdf, other

    cs.CV

    Text-to-3D Shape Generation

    Authors: Han-Hung Lee, Manolis Savva, Angel X. Chang

    Abstract: Recent years have seen an explosion of work and interest in text-to-3D shape generation. Much of the progress is driven by advances in 3D representations, large-scale pretraining and representation learning for text and image data enabling generative AI models, and differentiable rendering. Computational systems that can perform text-to-3D shape generation have captivated the popular imagination a… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

  21. arXiv:2403.12301  [pdf, other

    cs.CV

    R3DS: Reality-linked 3D Scenes for Panoramic Scene Understanding

    Authors: Qirui Wu, Sonia Raychaudhuri, Daniel Ritchie, Manolis Savva, Angel X Chang

    Abstract: We introduce the Reality-linked 3D Scenes (R3DS) dataset of synthetic 3D scenes mirroring the real-world scene arrangements from Matterport3D panoramas. Compared to prior work, R3DS has more complete and densely populated scenes with objects linked to real-world observations in panoramas. R3DS also provides an object support hierarchy, and matching object sets (e.g., same chairs around a dining ta… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  22. arXiv:2401.00405  [pdf, other

    cs.CV

    Generalizing Single-View 3D Shape Retrieval to Occlusions and Unseen Objects

    Authors: Qirui Wu, Daniel Ritchie, Manolis Savva, Angel X. Chang

    Abstract: Single-view 3D shape retrieval is a challenging task that is increasingly important with the growth of available 3D data. Prior work that has studied this task has not focused on evaluating how realistic occlusions impact performance, and how shape retrieval methods generalize to scenarios where either the target 3D shape database contains unseen shapes, or the input image contains unseen objects.… ▽ More

    Submitted 31 December, 2023; originally announced January 2024.

  23. arXiv:2311.02401  [pdf, other

    cs.LG

    BarcodeBERT: Transformers for Biodiversity Analysis

    Authors: Pablo Millan Arias, Niousha Sadjadi, Monireh Safari, ZeMing Gong, Austin T. Wang, Joakim Bruslund Haurum, Iuliia Zarubiieva, Dirk Steinke, Lila Kari, Angel X. Chang, Scott C. Lowe, Graham W. Taylor

    Abstract: In the global challenge of understanding and characterizing biodiversity, short species-specific genomic sequences known as DNA barcodes play a critical role, enabling fine-grained comparisons among organisms within the same kingdom of life. Although machine learning algorithms specifically designed for the analysis of DNA barcodes are becoming more popular, most existing methodologies rely on gen… ▽ More

    Submitted 21 January, 2025; v1 submitted 4 November, 2023; originally announced November 2023.

    Comments: Main text: 14 pages, Total: 23 pages, 10 figures, formerly accepted at the 4th Workshop on Self-Supervised Learning: Theory and Practice (NeurIPS 2023)

  24. arXiv:2309.05251  [pdf, other

    cs.CV

    Multi3DRefer: Grounding Text Description to Multiple 3D Objects

    Authors: Yiming Zhang, ZeMing Gong, Angel X. Chang

    Abstract: We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natural language descriptions. Existing 3D visual grounding tasks focus on localizing a unique object given a text description. However, such a strict setting is unnatural as localizing potentially multiple objects is a common need in real-world scenarios and robotic tasks (e.g., visual navigation and ob… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

    Comments: ICCV 2023

  25. arXiv:2307.10455  [pdf, other

    cs.CV cs.AI cs.LG

    A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect Dataset

    Authors: Zahra Gharaee, ZeMing Gong, Nicholas Pellegrino, Iuliia Zarubiieva, Joakim Bruslund Haurum, Scott C. Lowe, Jaclyn T. A. McKeown, Chris C. Y. Ho, Joschka McLeod, Yi-Yun C Wei, Jireh Agda, Sujeevan Ratnasingham, Dirk Steinke, Angel X. Chang, Graham W. Taylor, Paul Fieguth

    Abstract: In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-Insect Dataset. Each record is taxonomically classified by an expert, and also has associated genetic information including raw nucleotide barcode sequences and assigned barcode index numbers, which are genetically-based proxies for species classification. This paper presents a c… ▽ More

    Submitted 13 November, 2023; v1 submitted 19 July, 2023; originally announced July 2023.

  26. arXiv:2306.11290  [pdf, other

    cs.CV

    Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation

    Authors: Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X. Chang, Manolis Savva

    Abstract: We contribute the Habitat Synthetic Scene Dataset, a dataset of 211 high-quality 3D scenes, and use it to test navigation agent generalization to realistic 3D environments. Our dataset represents real interiors and contains a diverse set of 18,656 models of real-world objects. We investigate the impact of synthetic 3D scene dataset scale and realism on the task of training embodied agents to find… ▽ More

    Submitted 7 December, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

  27. arXiv:2305.18557  [pdf, other

    cs.CV

    Evaluating 3D Shape Analysis Methods for Robustness to Rotation Invariance

    Authors: Supriya Gadi Patil, Angel X. Chang, Manolis Savva

    Abstract: This paper analyzes the robustness of recent 3D shape descriptors to SO(3) rotations, something that is fundamental to shape modeling. Specifically, we formulate the task of rotated 3D object instance detection. To do so, we consider a database of 3D indoor scenes, where objects occur in different orientations. We benchmark different methods for feature extraction and classification in the context… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: 20th Conference on Robots and Vision (CRV) 2023

  28. arXiv:2304.03696  [pdf, other

    cs.RO cs.CV

    MOPA: Modular Object Navigation with PointGoal Agents

    Authors: Sonia Raychaudhuri, Tommaso Campari, Unnat Jain, Manolis Savva, Angel X. Chang

    Abstract: We propose a simple but effective modular approach MOPA (Modular ObjectNav with PointGoal agents) to systematically investigate the inherent modularity of the object navigation task in Embodied AI. MOPA consists of four modules: (a) an object detection module trained to identify objects from RGB images, (b) a map building module to build a semantic map of the observed objects, (c) an exploration m… ▽ More

    Submitted 27 January, 2024; v1 submitted 7 April, 2023; originally announced April 2023.

  29. arXiv:2303.14087  [pdf, other

    cs.CV

    OPDMulti: Openable Part Detection for Multiple Objects

    Authors: Xiaohao Sun, Hanxiao Jiang, Manolis Savva, Angel Xuan Chang

    Abstract: Openable part detection is the task of detecting the openable parts of an object in a single-view image, and predicting corresponding motion parameters. Prior work investigated the unrealistic setting where all input images only contain a single openable object. We generalize this task to scenes with multiple objects each potentially possessing openable parts, and create a corresponding dataset ba… ▽ More

    Submitted 24 March, 2023; originally announced March 2023.

  30. arXiv:2212.00836  [pdf, other

    cs.CV

    UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding

    Authors: Dave Zhenyu Chen, Ronghang Hu, Xinlei Chen, Matthias Nießner, Angel X. Chang

    Abstract: Performing 3D dense captioning and visual grounding requires a common and shared understanding of the underlying multimodal relationships. However, despite some previous attempts on connecting these two related tasks with highly task-specific neural modules, it remains understudied how to explicitly depict their shared nature to learn them simultaneously. In this work, we propose UniT3D, a simple… ▽ More

    Submitted 1 December, 2022; originally announced December 2022.

  31. arXiv:2212.00767  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Exploiting Proximity-Aware Tasks for Embodied Social Navigation

    Authors: Enrico Cancelli, Tommaso Campari, Luciano Serafini, Angel X. Chang, Lamberto Ballan

    Abstract: Learning how to navigate among humans in an occluded and spatially constrained indoor environment, is a key ability required to embodied agent to be integrated into our society. In this paper, we propose an end-to-end architecture that exploits Proximity-Aware Tasks (referred as to Risk and Proximity Compass) to inject into a reinforcement learning navigation policy the ability to infer common-sen… ▽ More

    Submitted 10 March, 2023; v1 submitted 1 December, 2022; originally announced December 2022.

  32. arXiv:2210.06849  [pdf, other

    cs.CV

    Retrospectives on the Embodied AI Workshop

    Authors: Matt Deitke, Dhruv Batra, Yonatan Bisk, Tommaso Campari, Angel X. Chang, Devendra Singh Chaplot, Changan Chen, Claudia Pérez D'Arpino, Kiana Ehsani, Ali Farhadi, Li Fei-Fei, Anthony Francis, Chuang Gan, Kristen Grauman, David Hall, Winson Han, Unnat Jain, Aniruddha Kembhavi, Jacob Krantz, Stefan Lee, Chengshu Li, Sagnik Majumder, Oleksandr Maksymets, Roberto Martín-Martín, Roozbeh Mottaghi , et al. (14 additional authors not shown)

    Abstract: We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of… ▽ More

    Submitted 4 December, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

  33. arXiv:2210.05633  [pdf, other

    cs.CV

    Habitat-Matterport 3D Semantics Dataset

    Authors: Karmesh Yadav, Ram Ramrakhya, Santhosh Kumar Ramakrishnan, Theo Gervet, John Turner, Aaron Gokaslan, Noah Maestre, Angel Xuan Chang, Dhruv Batra, Manolis Savva, Alexander William Clegg, Devendra Singh Chaplot

    Abstract: We present the Habitat-Matterport 3D Semantics (HM3DSEM) dataset. HM3DSEM is the largest dataset of 3D real-world spaces with densely annotated semantics that is currently available to the academic community. It consists of 142,646 object instance annotations across 216 3D spaces and 3,100 rooms within those spaces. The scale, quality, and diversity of object annotations far exceed those of prior… ▽ More

    Submitted 12 October, 2023; v1 submitted 11 October, 2022; originally announced October 2022.

    Comments: 15 Pages, 11 Figures, 6 Tables

  34. arXiv:2209.15172  [pdf, other

    cs.CV cs.GR cs.LG

    Understanding Pure CLIP Guidance for Voxel Grid NeRF Models

    Authors: Han-Hung Lee, Angel X. Chang

    Abstract: We explore the task of text to 3D object generation using CLIP. Specifically, we use CLIP for guidance without access to any datasets, a setting we refer to as pure CLIP guidance. While prior work has adopted this setting, there is no systematic study of mechanics for preventing adversarial generations within CLIP. We illustrate how different image-based augmentations prevent the adversarial gener… ▽ More

    Submitted 29 September, 2022; originally announced September 2022.

  35. arXiv:2209.05612  [pdf, other

    cs.CV

    Articulated 3D Human-Object Interactions from RGB Videos: An Empirical Analysis of Approaches and Challenges

    Authors: Sanjay Haresh, Xiaohao Sun, Hanxiao Jiang, Angel X. Chang, Manolis Savva

    Abstract: Human-object interactions with articulated objects are common in everyday life. Despite much progress in single-view 3D reconstruction, it is still challenging to infer an articulated 3D object model from an RGB video showing a person manipulating the object. We canonicalize the task of articulated 3D human-object interaction reconstruction from RGB video, and carry out a systematic benchmark of f… ▽ More

    Submitted 12 September, 2022; originally announced September 2022.

    Comments: 3DV 2022

  36. arXiv:2205.13675  [pdf, other

    cs.AR cs.AI cs.LG

    Reinforcement Learning Approach for Mapping Applications to Dataflow-Based Coarse-Grained Reconfigurable Array

    Authors: Andre Xian Ming Chang, Parth Khopkar, Bashar Romanous, Abhishek Chaurasia, Patrick Estep, Skyler Windh, Doug Vanesko, Sheik Dawood Beer Mohideen, Eugenio Culurciello

    Abstract: The Streaming Engine (SE) is a Coarse-Grained Reconfigurable Array which provides programming flexibility and high-performance with energy efficiency. An application program to be executed on the SE is represented as a combination of Synchronous Data Flow (SDF) graphs, where every instruction is represented as a node. Each node needs to be mapped to the right slot and array in the SE to ensure the… ▽ More

    Submitted 26 May, 2022; originally announced May 2022.

    Comments: 10 pages, 12 figures

  37. arXiv:2203.16421  [pdf, other

    cs.CV

    OPD: Single-view 3D Openable Part Detection

    Authors: Hanxiao Jiang, Yongsen Mao, Manolis Savva, Angel X. Chang

    Abstract: We address the task of predicting what parts of an object can open and how they move when they do so. The input is a single image of an object, and as output we detect what parts of the object can open, and the motion parameters describing the articulation of each openable part. To tackle this task, we create two datasets of 3D objects: OPDSynth based on existing synthetic objects, and OPDReal bas… ▽ More

    Submitted 30 March, 2022; originally announced March 2022.

  38. arXiv:2201.07366  [pdf, other

    cs.CV

    TriCoLo: Trimodal Contrastive Loss for Text to Shape Retrieval

    Authors: Yue Ruan, Han-Hung Lee, Yiming Zhang, Ke Zhang, Angel X. Chang

    Abstract: Text-to-shape retrieval is an increasingly relevant problem with the growth of 3D shape data. Recent work on contrastive losses for learning joint embeddings over multimodal data has been successful at tasks such as retrieval and classification. Thus far, work on joint representation learning for 3D shapes and text has focused on improving embeddings through modeling of complex attention between r… ▽ More

    Submitted 27 December, 2023; v1 submitted 18 January, 2022; originally announced January 2022.

    Comments: Accepted by WACV 2024

  39. Roominoes: Generating Novel 3D Floor Plans From Existing 3D Rooms

    Authors: Kai Wang, Xianghao Xu, Leon Lei, Selena Ling, Natalie Lindsay, Angel X. Chang, Manolis Savva, Daniel Ritchie

    Abstract: Realistic 3D indoor scene datasets have enabled significant recent progress in computer vision, scene understanding, autonomous navigation, and 3D reconstruction. But the scale, diversity, and customizability of existing datasets is limited, and it is time-consuming and expensive to scan and annotate more. Fortunately, combinatorics is on our side: there are enough individual rooms in existing 3D… ▽ More

    Submitted 10 December, 2021; originally announced December 2021.

    Comments: Symposium on Geometry Processing (SGP) 2021

    Journal ref: Computer Graphics Forum, 40: 57-69 (2021)

  40. arXiv:2112.01551  [pdf, other

    cs.CV

    D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding

    Authors: Dave Zhenyu Chen, Qirui Wu, Matthias Nießner, Angel X. Chang

    Abstract: Recent studies on dense captioning and visual grounding in 3D have achieved impressive results. Despite developments in both areas, the limited amount of available 3D vision-language data causes overfitting issues for 3D visual grounding and 3D dense captioning methods. Also, how to discriminatively describe objects in complex 3D environments is not fully studied yet. To address these challenges,… ▽ More

    Submitted 22 July, 2022; v1 submitted 2 December, 2021; originally announced December 2021.

    Comments: Project website: https://daveredrum.github.io/D3Net/

  41. arXiv:2110.05769  [pdf, other

    cs.CV cs.AI cs.LG cs.MA

    Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents

    Authors: Shivansh Patel, Saim Wani, Unnat Jain, Alexander Schwing, Svetlana Lazebnik, Manolis Savva, Angel X. Chang

    Abstract: Communication between embodied AI agents has received increasing attention in recent years. Despite its use, it is still unclear whether the learned communication is interpretable and grounded in perception. To study the grounding of emergent forms of communication, we first introduce the collaborative multi-object navigation task CoMON. In this task, an oracle agent has detailed environment infor… ▽ More

    Submitted 12 October, 2021; originally announced October 2021.

    Comments: Project page: https://shivanshpatel35.github.io/comon/ ; the first three authors contributed equally

  42. arXiv:2109.15207  [pdf, other

    cs.CV cs.CL cs.RO

    Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments

    Authors: Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, Angel X. Chang

    Abstract: In the Vision-and-Language Navigation (VLN) task an embodied agent navigates a 3D environment, following natural language instructions. A challenge in this task is how to handle 'off the path' scenarios where an agent veers from a reference path. Prior work supervises the agent with actions based on the shortest path from the agent's location to the goal, but such goal-oriented supervision is ofte… ▽ More

    Submitted 30 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021

  43. arXiv:2109.08238  [pdf, other

    cs.CV cs.AI

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    Authors: Santhosh K. Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X. Chang, Manolis Savva, Yili Zhao, Dhruv Batra

    Abstract: We present the Habitat-Matterport 3D (HM3D) dataset. HM3D is a large-scale dataset of 1,000 building-scale 3D reconstructions from a diverse set of real-world locations. Each scene in the dataset consists of a textured 3D mesh reconstruction of interiors such as multi-floor residences, stores, and other private indoor spaces. HM3D surpasses existing datasets available for academic research in te… ▽ More

    Submitted 16 September, 2021; originally announced September 2021.

    Comments: 21 pages, 14 figures

  44. arXiv:2106.06629  [pdf, other

    cs.CV

    Mirror3D: Depth Refinement for Mirror Surfaces

    Authors: Jiaqi Tan, Weijie Lin, Angel X. Chang, Manolis Savva

    Abstract: Despite recent progress in depth sensing and 3D reconstruction, mirror surfaces are a significant source of errors. To address this problem, we create the Mirror3D dataset: a 3D mirror plane dataset based on three RGBD datasets (Matterport3D, NYUv2 and ScanNet) containing 7,011 mirror instance masks and 3D planes. We then develop Mirror3DNet: a module that refines raw sensor depth or estimated dep… ▽ More

    Submitted 11 June, 2021; originally announced June 2021.

    Comments: Paper presented at CVPR 2021. For code, data and pretrained models, see https://3dlg-hcvc.github.io/mirror3d/

  45. arXiv:2106.05375  [pdf, other

    cs.CV cs.GR

    Plan2Scene: Converting Floorplans to 3D Scenes

    Authors: Madhawa Vidanapathirana, Qirui Wu, Yasutaka Furukawa, Angel X. Chang, Manolis Savva

    Abstract: We address the task of converting a floorplan and a set of associated photos of a residence into a textured 3D mesh model, a task which we call Plan2Scene. Our system 1) lifts a floorplan image to a 3D mesh model; 2) synthesizes surface textures based on the input photos; and 3) infers textures for unobserved surfaces using a graph neural network architecture. To train and evaluate our system we c… ▽ More

    Submitted 9 June, 2021; originally announced June 2021.

    Comments: This paper is accepted to CVPR 2021. For code, data and pretrained models, see https://3dlg-hcvc.github.io/plan2scene/

  46. arXiv:2012.03912  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    MultiON: Benchmarking Semantic Map Memory using Multi-Object Navigation

    Authors: Saim Wani, Shivansh Patel, Unnat Jain, Angel X. Chang, Manolis Savva

    Abstract: Navigation tasks in photorealistic 3D environments are challenging because they require perception and effective planning under partial observability. Recent work shows that map-like memory is useful for long-horizon navigation tasks. However, a focused investigation of the impact of maps on navigation tasks of varying complexity has not yet been performed. We propose the multiON task, which requi… ▽ More

    Submitted 7 December, 2020; originally announced December 2020.

    Comments: Project page: https://shivanshpatel35.github.io/multi-ON/ ; the first three authors contributed equally

  47. arXiv:2012.02206  [pdf, other

    cs.CV cs.LG eess.IV

    Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

    Authors: Dave Zhenyu Chen, Ali Gholami, Matthias Nießner, Angel X. Chang

    Abstract: We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in… ▽ More

    Submitted 3 December, 2020; originally announced December 2020.

    Comments: Video: https://youtu.be/AgmIpDbwTCY

  48. arXiv:2011.01975  [pdf, other

    cs.AI cs.CV cs.LG cs.RO

    Rearrangement: A Challenge for Embodied AI

    Authors: Dhruv Batra, Angel X. Chang, Sonia Chernova, Andrew J. Davison, Jia Deng, Vladlen Koltun, Sergey Levine, Jitendra Malik, Igor Mordatch, Roozbeh Mottaghi, Manolis Savva, Hao Su

    Abstract: We describe a framework for research and evaluation in Embodied AI. Our proposal is based on a canonical task: Rearrangement. A standard task can focus the development of new techniques and serve as a source of trained models that can be transferred to other settings. In the rearrangement task, the goal is to bring a given physical environment into a specified state. The goal state can be specifie… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: Authors are listed in alphabetical order

  49. arXiv:2003.08515  [pdf, other

    cs.CV cs.RO

    SAPIEN: A SimulAted Part-based Interactive ENvironment

    Authors: Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang, Leonidas J. Guibas, Hao Su

    Abstract: Building home assistant robots has long been a pursuit for vision and robotics researchers. To achieve this task, a simulated environment with physically realistic simulation, sufficient articulated objects, and transferability to the real robot is indispensable. Existing environments achieve these requirements for robotics simulation with different levels of simplification and focus. We take one… ▽ More

    Submitted 18 March, 2020; originally announced March 2020.

  50. arXiv:1912.08830  [pdf, other

    cs.CV cs.CL cs.LG eess.IV

    ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language

    Authors: Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner

    Abstract: We introduce the task of 3D object localization in RGB-D scans using natural language descriptions. As input, we assume a point cloud of a scanned 3D scene along with a free-form description of a specified target object. To address this task, we propose ScanRefer, learning a fused descriptor from 3D object proposals and encoded sentence embeddings. This fused descriptor correlates language express… ▽ More

    Submitted 11 November, 2020; v1 submitted 18 December, 2019; originally announced December 2019.

    Comments: Project page: https://daveredrum.github.io/ScanRefer/