Skip to main content

Showing 1–44 of 44 results for author: Kosecka, J

.
  1. arXiv:2506.03516  [pdf, ps, other

    cs.RO cs.AI

    SemNav: A Model-Based Planner for Zero-Shot Object Goal Navigation Using Vision-Foundation Models

    Authors: Arnab Debnath, Gregory J. Stein, Jana Kosecka

    Abstract: Object goal navigation is a fundamental task in embodied AI, where an agent is instructed to locate a target object in an unexplored environment. Traditional learning-based methods rely heavily on large-scale annotated data or require extensive interaction with the environment in a reinforcement learning setting, often failing to generalize to novel environments and limiting scalability. To overco… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: Accepted at CVPR 2025 workshop - Foundation Models Meet Embodied Agents

  2. arXiv:2506.02593  [pdf, ps, other

    cs.RO

    A Hybrid Approach to Indoor Social Navigation: Integrating Reactive Local Planning and Proactive Global Planning

    Authors: Arnab Debnath, Gregory J. Stein, Jana Kosecka

    Abstract: We consider the problem of indoor building-scale social navigation, where the robot must reach a point goal as quickly as possible without colliding with humans who are freely moving around. Factors such as varying crowd densities, unpredictable human behavior, and the constraints of indoor spaces add significant complexity to the navigation task, necessitating a more advanced approach. We propose… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: Accepted at ICRA 2025

  3. arXiv:2505.02278  [pdf, other

    cs.CV

    Compositional Image-Text Matching and Retrieval by Grounding Entities

    Authors: Madhukar Reddy Vongala, Saurabh Srivastava, Jana Košecká

    Abstract: Vision-language pretraining on large datasets of images-text pairs is one of the main building blocks of current Vision-Language Models. While with additional training, these models excel in various downstream tasks, including visual question answering, image captioning, and visual commonsense reasoning. However, a notable weakness of pretrained models like CLIP, is their inability to perform enti… ▽ More

    Submitted 4 May, 2025; originally announced May 2025.

    Comments: Accepted at CVPR-W

  4. arXiv:2502.07306  [pdf, ps, other

    cs.CV cs.AI cs.CL cs.LG cs.RO

    TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation

    Authors: Navid Rajabi, Jana Kosecka

    Abstract: In this work, we propose a modular approach for the Vision-Language Navigation (VLN) task by decomposing the problem into four sub-modules that use state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs) in a zero-shot setting. Given navigation instruction in natural language, we first prompt LLM to extract the landmarks and the order in which they are visited. Assuming the… ▽ More

    Submitted 9 June, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

    Comments: Accepted to CVPR 2025 Workshop - Foundation Models Meet Embodied Agents

  5. arXiv:2410.07394  [pdf, other

    cs.CV

    Structured Spatial Reasoning with Open Vocabulary Object Detectors

    Authors: Negar Nejatishahidin, Madhukar Reddy Vongala, Jana Kosecka

    Abstract: Reasoning about spatial relationships between objects is essential for many real-world robotic tasks, such as fetch-and-delivery, object rearrangement, and object search. The ability to detect and disambiguate different objects and identify their location is key to successful completion of these tasks. Several recent works have used powerful Vision and Language Models (VLMs) to unlock this capabil… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.

  6. arXiv:2407.01394  [pdf, other

    cs.CV cs.CL cs.LG

    Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

    Authors: Pooya Fayyazsanavi, Antonios Anastasopoulos, Jana Košecká

    Abstract: Sign language translation from video to spoken text presents unique challenges owing to the distinct grammar, expression nuances, and high variation of visual appearance across different speakers and contexts. The intermediate gloss annotations of videos aim to guide the translation process. In our work, we focus on {\em Gloss2Text} translation stage and propose several advances by leveraging pre-… ▽ More

    Submitted 12 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

  7. arXiv:2406.13246  [pdf, other

    cs.CL cs.CV cs.LG

    GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs

    Authors: Navid Rajabi, Jana Kosecka

    Abstract: The ability to understand and reason about spatial relationships between objects in images is an important component of visual reasoning. This skill rests on the ability to recognize and localize objects of interest and determine their spatial relation. Early vision and language models (VLMs) have been shown to struggle to recognize spatial relations. We extend the previously released What'sUp dat… ▽ More

    Submitted 10 October, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

    Comments: Accepted to NeurIPS 2024 Workshop on Compositional Learning

  8. arXiv:2404.19128  [pdf, other

    cs.CV cs.CL cs.LG

    Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM

    Authors: Navid Rajabi, Jana Kosecka

    Abstract: Vision and Language Models (VLMs) continue to demonstrate remarkable zero-shot (ZS) performance across various tasks. However, many probing studies have revealed that even the best-performing VLMs struggle to capture aspects of compositional scene understanding, lacking the ability to properly ground and localize linguistic phrases in images. Recent VLM advancements include scaling up both model a… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR 2024, Second Workshop on Foundation Models (WFM)

  9. arXiv:2401.16575  [pdf, other

    cs.CL cs.CV

    Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking

    Authors: Ivana Beňová, Jana Košecká, Michal Gregor, Martin Tamajka, Marcel Veselý, Marián Šimko

    Abstract: The dominant probing approaches rely on the zero-shot performance of image-text matching tasks to gain a finer-grained understanding of the representations learned by recent multimodal image-language transformer models. The evaluation is carried out on carefully curated datasets focusing on counting, relations, attributes, and others. This work introduces an alternative probing strategy called gui… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: 9 pages of text, 11 pages total, 7 figures, 3 tables, preprint

  10. arXiv:2311.12128  [pdf, other

    cs.CV cs.HC

    Fingerspelling PoseNet: Enhancing Fingerspelling Translation with Pose-Based Transformer Models

    Authors: Pooya Fayyazsanavi, Negar Nejatishahidin, Jana Kosecka

    Abstract: We address the task of American Sign Language fingerspelling translation using videos in the wild. We exploit advances in more accurate hand pose estimation and propose a novel architecture that leverages the transformer based encoder-decoder model enabling seamless contextual word translation. The translation model is augmented by a novel loss term that accurately predicts the length of the finge… ▽ More

    Submitted 20 November, 2023; originally announced November 2023.

    Comments: WACV 2024

  11. arXiv:2311.10883  [pdf, other

    cs.CV cs.CL cs.RO

    Labeling Indoor Scenes with Fusion of Out-of-the-Box Perception Models

    Authors: Yimeng Li, Navid Rajabi, Sulabh Shrestha, Md Alimoor Reza, Jana Kosecka

    Abstract: The image annotation stage is a critical and often the most time-consuming part required for training and evaluating object detection and semantic segmentation models. Deployment of the existing models in novel environments often requires detecting novel semantic classes not present in the training data. Furthermore, indoor scenes contain significant viewpoint variations, which need to be handled… ▽ More

    Submitted 17 November, 2023; originally announced November 2023.

  12. arXiv:2308.09778  [pdf, other

    cs.CV cs.CL cs.LG

    Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

    Authors: Navid Rajabi, Jana Kosecka

    Abstract: Large vision-and-language models (VLMs) trained to match images with text on large-scale datasets of image-text pairs have shown impressive generalization ability on several vision and language tasks. Several recent works, however, showed that these models lack fine-grained understanding, such as the ability to count and recognize verbs, attributes, or relationships. The focus of this work is to s… ▽ More

    Submitted 5 March, 2024; v1 submitted 18 August, 2023; originally announced August 2023.

    Comments: Accepted to DMLR @ ICLR 2024

  13. arXiv:2304.13201  [pdf, other

    cs.CV

    Graph-CoVis: GNN-based Multi-view Panorama Global Pose Estimation

    Authors: Negar Nejatishahidin, Will Hutchcroft, Manjunath Narayana, Ivaylo Boyadzhiev, Yuguang Li, Naji Khosravan, Jana Kosecka, Sing Bing Kang

    Abstract: In this paper, we address the problem of wide-baseline camera pose estimation from a group of 360$^\circ$ panoramas under upright-camera assumption. Recent work has demonstrated the merit of deep-learning for end-to-end direct relative pose regression in 360$^\circ$ panorama pairs [11]. To exploit the benefits of multi-view logic in a learning-based framework, we introduce Graph-CoVis, which non-t… ▽ More

    Submitted 25 April, 2023; originally announced April 2023.

  14. arXiv:2304.08580  [pdf, other

    cs.CV cs.RO

    U2RLE: Uncertainty-Guided 2-Stage Room Layout Estimation

    Authors: Pooya Fayyazsanavi, Zhiqiang Wan, Will Hutchcroft, Ivaylo Boyadzhiev, Yuguang Li, Jana Kosecka, Sing Bing Kang

    Abstract: While the existing deep learning-based room layout estimation techniques demonstrate good overall accuracy, they are less effective for distant floor-wall boundary. To tackle this problem, we propose a novel uncertainty-guided approach for layout boundary estimation introducing new two-stage CNN architecture termed U2RLE. The initial stage predicts both floor-wall boundary and its uncertainty and… ▽ More

    Submitted 17 April, 2023; originally announced April 2023.

    Comments: To be Appear on CVPR 2023

  15. arXiv:2212.08801  [pdf, other

    cs.RO cs.CV

    Comparison of Model-Free and Model-Based Learning-Informed Planning for PointGoal Navigation

    Authors: Yimeng Li, Arnab Debnath, Gregory J. Stein, Jana Kosecka

    Abstract: In recent years several learning approaches to point goal navigation in previously unseen environments have been proposed. They vary in the representations of the environments, problem decomposition, and experimental evaluation. In this work, we compare the state-of-the-art Deep Reinforcement Learning based approaches with Partially Observable Markov Decision Process (POMDP) formulation of the poi… ▽ More

    Submitted 17 December, 2022; originally announced December 2022.

    Comments: arXiv admin note: text overlap with arXiv:2211.07898

  16. arXiv:2211.07898  [pdf, other

    cs.RO cs.CV

    Learning-Augmented Model-Based Planning for Visual Exploration

    Authors: Yimeng Li, Arnab Debnath, Gregory Stein, Jana Kosecka

    Abstract: We consider the problem of time-limited robotic exploration in previously unseen environments where exploration is limited by a predefined amount of time. We propose a novel exploration approach using learning-augmented model-based planning. We generate a set of subgoals associated with frontiers on the current map and derive a Bellman Equation for exploration with these subgoals. Visual sensing a… ▽ More

    Submitted 9 August, 2023; v1 submitted 14 November, 2022; originally announced November 2022.

    Comments: Accepted to IROS 2023

  17. arXiv:2210.01884  [pdf, other

    cs.CV

    Self-supervised Pre-training for Semantic Segmentation in an Indoor Scene

    Authors: Sulabh Shrestha, Yimeng Li, Jana Kosecka

    Abstract: The ability to endow maps of indoor scenes with semantic information is an integral part of robotic agents which perform different tasks such as target driven navigation, object search or object rearrangement. The state-of-the-art methods use Deep Convolutional Neural Networks (DCNNs) for predicting semantic segmentation of an image as useful representation for these tasks. The accuracy of semanti… ▽ More

    Submitted 4 October, 2022; originally announced October 2022.

  18. arXiv:2209.13137  [pdf

    cs.RO

    Using Unmanned Aerial Systems (UAS) for Assessing and Monitoring Fall Hazard Prevention Systems in High-rise Building Projects

    Authors: Yimeng Li, Behzad Esmaeili, Masoud Gheisari, Jana Kosecka, Abbas Rashidi

    Abstract: This study develops a framework for unmanned aerial systems (UASs) to monitor fall hazard prevention systems near unprotected edges and openings in high-rise building projects. A three-step machine-learning-based framework was developed and tested to detect guardrail posts from the images captured by UAS. First, a guardrail detector was trained to localize the candidate locations of posts supporti… ▽ More

    Submitted 26 September, 2022; originally announced September 2022.

  19. arXiv:2203.01449  [pdf, other

    cs.CV cs.RO

    Object Pose Estimation using Mid-level Visual Representations

    Authors: Negar Nejatishahidin, Pooya Fayyazsanavi, Jana Kosecka

    Abstract: This work proposes a novel pose estimation model for object categories that can be effectively transferred to previously unseen environments. The deep convolutional network models (CNN) for pose estimation are typically trained and evaluated on datasets specifically curated for object detection, pose estimation, or 3D reconstruction, which requires large amounts of training data. In this work, we… ▽ More

    Submitted 2 March, 2022; originally announced March 2022.

  20. arXiv:2111.12866  [pdf, other

    cs.CV

    Uncertainty Aware Proposal Segmentation for Unknown Object Detection

    Authors: Yimeng Li, Jana Kosecka

    Abstract: Recent efforts in deploying Deep Neural Networks for object detection in real world applications, such as autonomous driving, assume that all relevant object classes have been observed during training. Quantifying the performance of these models in settings when the test data is not represented in the training set has mostly focused on pixel-level uncertainty estimation techniques of models traine… ▽ More

    Submitted 24 November, 2021; originally announced November 2021.

    Comments: Accepted to WACV 2022 DNOW Workshop

  21. arXiv:2109.08218  [pdf, other

    cs.LG

    SLAW: Scaled Loss Approximate Weighting for Efficient Multi-Task Learning

    Authors: Michael Crawshaw, Jana Košecká

    Abstract: Multi-task learning (MTL) is a subfield of machine learning with important applications, but the multi-objective nature of optimization in MTL leads to difficulties in balancing training between tasks. The best MTL optimization methods require individually computing the gradient of each task's loss function, which impedes scalability to a large number of tasks. In this paper, we propose Scaled Los… ▽ More

    Submitted 16 September, 2021; originally announced September 2021.

  22. arXiv:2006.15127  [pdf, other

    eess.IV cs.LG

    Diverse Knowledge Distillation (DKD): A Solution for Improving The Robustness of Ensemble Models Against Adversarial Attacks

    Authors: Ali Mirzaeian, Jana Kosecka, Houman Homayoun, Tinoosh Mohsenin, Avesta Sasan

    Abstract: This paper proposes an ensemble learning model that is resistant to adversarial attacks. To build resilience, we introduced a training process where each member learns a radically distinct latent space. Member models are added one at a time to the ensemble. Simultaneously, the loss function is regulated by a reverse knowledge distillation, forcing the new member to learn different features and map… ▽ More

    Submitted 7 January, 2021; v1 submitted 26 June, 2020; originally announced June 2020.

  23. arXiv:2003.08753  [pdf, other

    cs.CV cs.HC cs.LG stat.ML

    FineHand: Learning Hand Shapes for American Sign Language Recognition

    Authors: Al Amin Hosain, Panneer Selvam Santhalingam, Parth Pathak, Huzefa Rangwala, Jana Kosecka

    Abstract: American Sign Language recognition is a difficult gesture recognition problem, characterized by fast, highly articulate gestures. These are comprised of arm movements with different hand shapes, facial expression and head movements. Among these components, hand shape is the vital, often the most discriminative part of a gesture. In this work, we present an approach for effective learning of hand s… ▽ More

    Submitted 4 March, 2020; originally announced March 2020.

  24. arXiv:2003.08743  [pdf, other

    cs.CV

    Generative Multi-Stream Architecture For American Sign Language Recognition

    Authors: Dom Huh, Sai Gurrapu, Frederick Olson, Huzefa Rangwala, Parth Pathak, Jana Kosecka

    Abstract: With advancements in deep model architectures, tasks in computer vision can reach optimal convergence provided proper data preprocessing and model parameter initialization. However, training on datasets with low feature-richness for complex applications limit and detriment optimal convergence below human performance. In past works, researchers have provided external sources of complementary data a… ▽ More

    Submitted 9 March, 2020; originally announced March 2020.

  25. arXiv:2003.04232  [pdf, other

    cs.CV cs.LG cs.RO

    Hierarchical Kinematic Human Mesh Recovery

    Authors: Georgios Georgakis, Ren Li, Srikrishna Karanam, Terrence Chen, Jana Kosecka, Ziyan Wu

    Abstract: We consider the problem of estimating a parametric model of 3D human mesh from a single image. While there has been substantial recent progress in this area with direct regression of model parameters, these methods only implicitly exploit the human body kinematic structure, leading to sub-optimal use of the model prior. In this work, we address this gap by proposing a new technique for regression… ▽ More

    Submitted 14 July, 2020; v1 submitted 9 March, 2020; originally announced March 2020.

    Comments: 17 pages, 8 figures, 5 tables, ECCV 2020

  26. arXiv:2003.02327  [pdf, other

    cs.CV cs.RO

    Learning View and Target Invariant Visual Servoing for Navigation

    Authors: Yimeng Li, Jana Kosecka

    Abstract: The advances in deep reinforcement learning recently revived interest in data-driven learning based approaches to navigation. In this paper we propose to learn viewpoint invariant and target invariant visual servoing for local mobile robot navigation; given an initial view and the goal view or an image of a target, we train deep convolutional network controller to reach the desired goal. We presen… ▽ More

    Submitted 4 March, 2020; originally announced March 2020.

    Comments: Accepted to ICRA 2020

  27. arXiv:1911.07980  [pdf, other

    cs.CV cs.RO

    Simultaneous Mapping and Target Driven Navigation

    Authors: Georgios Georgakis, Yimeng Li, Jana Kosecka

    Abstract: This work presents a modular architecture for simultaneous mapping and target driven navigation in indoors environments. The semantic and appearance stored in 2.5D map is distilled from RGB images, semantic segmentation and outputs of object detectors by convolutional neural networks. Given this representation, the mapping module learns to localize the agent and register consecutive observations i… ▽ More

    Submitted 18 November, 2019; originally announced November 2019.

  28. arXiv:1909.11232  [pdf, other

    cs.LG cs.CL cs.CV stat.ML

    Sign Language Recognition Analysis using Multimodal Data

    Authors: Al Amin Hosain, Panneer Selvam Santhalingam, Parth Pathak, Jana Kosecka, Huzefa Rangwala

    Abstract: Voice-controlled personal and home assistants (such as the Amazon Echo and Apple Siri) are becoming increasingly popular for a variety of applications. However, the benefits of these technologies are not readily accessible to Deaf or Hard-ofHearing (DHH) users. The objective of this study is to develop and evaluate a sign recognition system using multiple modalities that can be used by DHH signers… ▽ More

    Submitted 24 September, 2019; originally announced September 2019.

    Comments: conference : IEEE DSAA, 2019, Washington DC

  29. arXiv:1811.07249  [pdf, other

    cs.CV cs.LG cs.RO

    Learning Local RGB-to-CAD Correspondences for Object Pose Estimation

    Authors: Georgios Georgakis, Srikrishna Karanam, Ziyan Wu, Jana Kosecka

    Abstract: We consider the problem of 3D object pose estimation. While much recent work has focused on the RGB domain, the reliance on accurately annotated images limits their generalizability and scalability. On the other hand, the easily available CAD models of objects are rich sources of data, providing a large number of synthetically rendered images. In this paper, we solve this key problem of existing m… ▽ More

    Submitted 31 July, 2019; v1 submitted 17 November, 2018; originally announced November 2018.

    Comments: 10 pages, 6 figures, 4 tables, ICCV 2019

  30. arXiv:1807.06757  [pdf, other

    cs.AI cs.CV cs.LG cs.RO

    On Evaluation of Embodied Navigation Agents

    Authors: Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, Amir R. Zamir

    Abstract: Skillful mobile operation in three-dimensional environments is a primary topic of study in Artificial Intelligence. The past two years have seen a surge of creative work on navigation. This creative output has produced a plethora of sometimes incompatible task definitions and evaluation protocols. To coordinate ongoing and future research in this area, we have convened a working group to study emp… ▽ More

    Submitted 17 July, 2018; originally announced July 2018.

    Comments: Report of a working group on empirical methodology in navigation research. Authors are listed in alphabetical order

  31. arXiv:1806.03370  [pdf, other

    cs.CV

    Self-supervisory Signals for Object Discovery and Detection

    Authors: Etienne Pot, Alexander Toshev, Jana Kosecka

    Abstract: In robotic applications, we often face the challenge of discovering new objects while having very little or no labelled training data. In this paper we explore the use of self-supervision provided by a robot traversing an environment to learn representations of encountered objects. Knowledge of ego-motion and depth perception enables the agent to effectively associate multiple object proposals, wh… ▽ More

    Submitted 8 June, 2018; originally announced June 2018.

  32. arXiv:1805.06066  [pdf, other

    cs.CV

    Visual Representations for Semantic Target Driven Navigation

    Authors: Arsalan Mousavian, Alexander Toshev, Marek Fiser, Jana Kosecka, Ayzaan Wahid, James Davidson

    Abstract: What is a good visual representation for autonomous agents? We address this question in the context of semantic visual navigation, which is the problem of a robot finding its way through a complex environment to a target object, e.g. go to the refrigerator. Instead of acquiring a metric semantic map of an environment and using planning for navigation, our approach learns navigation policies on top… ▽ More

    Submitted 2 July, 2019; v1 submitted 15 May, 2018; originally announced May 2018.

    Comments: Accepted to ICRA 2019 and ECCV 2018 Workshop on Visual Learning and Embodied Agents in Simulation Environments

  33. arXiv:1803.04610  [pdf, other

    cs.CV

    Target Driven Instance Detection

    Authors: Phil Ammirato, Cheng-Yang Fu, Mykhailo Shvets, Jana Kosecka, Alexander C. Berg

    Abstract: While state-of-the-art general object detectors are getting better and better, there are not many systems specifically designed to take advantage of the instance detection problem. For many applications, such as household robotics, a system may need to recognize a few very specific instances at a time. Speed can be critical in these applications, as can the need to recognize previously unseen inst… ▽ More

    Submitted 1 October, 2019; v1 submitted 12 March, 2018; originally announced March 2018.

  34. arXiv:1802.07869  [pdf, other

    cs.CV

    End-to-end learning of keypoint detector and descriptor for pose invariant 3D matching

    Authors: Georgios Georgakis, Srikrishna Karanam, Ziyan Wu, Jan Ernst, Jana Kosecka

    Abstract: Finding correspondences between images or 3D scans is at the heart of many computer vision and image retrieval applications and is often enabled by matching local keypoint descriptors. Various learning approaches have been applied in the past to different stages of the matching pipeline, considering detector, descriptor, or metric learning objectives. These objectives were typically addressed sepa… ▽ More

    Submitted 9 May, 2018; v1 submitted 21 February, 2018; originally announced February 2018.

    Comments: 9 pages, 9 figures, 3 tables, CVPR 2018

  35. arXiv:1708.00514  [pdf, other

    cs.CV

    Dense Piecewise Planar RGB-D SLAM for Indoor Environments

    Authors: Phi-Hung Le, Jana Kosecka

    Abstract: The paper exploits weak Manhattan constraints to parse the structure of indoor environments from RGB-D video sequences in an online setting. We extend the previous approach for single view parsing of indoor scenes to video sequences and formulate the problem of recovering the floor plan of the environment as an optimal labeling problem solved using dynamic programming. The temporal continuity is e… ▽ More

    Submitted 1 August, 2017; originally announced August 2017.

    Comments: International Conference on Intelligent Robots and Systems (IROS) 2017

  36. arXiv:1702.08272  [pdf, other

    cs.CV

    A Dataset for Developing and Benchmarking Active Vision

    Authors: Phil Ammirato, Patrick Poirson, Eunbyung Park, Jana Kosecka, Alexander C. Berg

    Abstract: We present a new public dataset with a focus on simulating robotic vision tasks in everyday indoor environments using real imagery. The dataset includes 20,000+ RGB-D images and 50,000+ 2D bounding boxes of object instances densely captured in 9 unique scenes. We train a fast object category detector for instance detection on our data. Using the dataset we show that, although increasingly accurate… ▽ More

    Submitted 3 March, 2017; v1 submitted 27 February, 2017; originally announced February 2017.

    Comments: To appear at ICRA 2017

  37. arXiv:1702.07836  [pdf, other

    cs.CV cs.RO

    Synthesizing Training Data for Object Detection in Indoor Scenes

    Authors: Georgios Georgakis, Arsalan Mousavian, Alexander C. Berg, Jana Kosecka

    Abstract: Detection of objects in cluttered indoor environments is one of the key enabling functionalities for service robots. The best performing object detection approaches in computer vision exploit deep Convolutional Neural Networks (CNN) to simultaneously detect and categorize the objects of interest in cluttered scenes. Training of such models typically requires large amounts of annotated training dat… ▽ More

    Submitted 7 September, 2017; v1 submitted 25 February, 2017; originally announced February 2017.

    Comments: Added more experiments and link to project webpage

  38. arXiv:1612.00496  [pdf, other

    cs.CV

    3D Bounding Box Estimation Using Deep Learning and Geometry

    Authors: Arsalan Mousavian, Dragomir Anguelov, John Flynn, Jana Kosecka

    Abstract: We present a method for 3D object detection and pose estimation from a single image. In contrast to current techniques that only regress the 3D orientation of an object, our method first regresses relatively stable 3D object properties using a deep convolutional neural network and then combines these estimates with geometric constraints provided by a 2D object bounding box to produce a complete 3D… ▽ More

    Submitted 10 April, 2017; v1 submitted 1 December, 2016; originally announced December 2016.

    Comments: To appear in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017

  39. arXiv:1609.07826  [pdf, other

    cs.CV cs.RO

    Multiview RGB-D Dataset for Object Instance Detection

    Authors: Georgios Georgakis, Md Alimoor Reza, Arsalan Mousavian, Phi-Hung Le, Jana Kosecka

    Abstract: This paper presents a new multi-view RGB-D dataset of nine kitchen scenes, each containing several objects in realistic cluttered environments including a subset of objects from the BigBird dataset. The viewpoints of the scenes are densely sampled and objects in the scenes are annotated with bounding boxes and in the 3D point cloud. Also, an approach for detection and recognition is presented, whi… ▽ More

    Submitted 25 September, 2016; originally announced September 2016.

  40. arXiv:1609.05590  [pdf, other

    cs.CV

    Fast Single Shot Detection and Pose Estimation

    Authors: Patrick Poirson, Phil Ammirato, Cheng-Yang Fu, Wei Liu, Jana Kosecka, Alexander C. Berg

    Abstract: For applications in navigation and robotics, estimating the 3D pose of objects is as important as detection. Many approaches to pose estimation rely on detecting or tracking parts or keypoints [11, 21]. In this paper we build on a recent state-of-the-art convolutional network for slidingwindow detection [10] to provide detection and rough pose estimation in a single shot, without intermediate stag… ▽ More

    Submitted 18 September, 2016; originally announced September 2016.

  41. arXiv:1609.00278  [pdf, other

    cs.CV cs.RO

    Semantic Image Based Geolocation Given a Map

    Authors: Arsalan Mousavian, Jana Kosecka

    Abstract: The problem visual place recognition is commonly used strategy for localization. Most successful appearance based methods typically rely on a large database of views endowed with local or global image descriptors and strive to retrieve the views of the same location. The quality of the results is often affected by the density of the reference views and the robustness of the image representation wi… ▽ More

    Submitted 1 September, 2016; originally announced September 2016.

  42. arXiv:1606.01178  [pdf, other

    cs.CV cs.RO

    Reinforcement Learning for Semantic Segmentation in Indoor Scenes

    Authors: Md. Alimoor Reza, Jana Kosecka

    Abstract: Future advancements in robot autonomy and sophistication of robotics tasks rest on robust, efficient, and task-dependent semantic understanding of the environment. Semantic segmentation is the problem of simultaneous segmentation and categorization of a partition of sensory data. The majority of current approaches tackle this using multi-class segmentation and labeling in a Conditional Random Fiel… ▽ More

    Submitted 3 June, 2016; originally announced June 2016.

  43. arXiv:1604.07480  [pdf, other

    cs.CV

    Joint Semantic Segmentation and Depth Estimation with Deep Convolutional Networks

    Authors: Arsalan Mousavian, Hamed Pirsiavash, Jana Kosecka

    Abstract: Multi-scale deep CNNs have been used successfully for problems mapping each pixel to a label, such as depth estimation and semantic segmentation. It has also been shown that such architectures are reusable and can be used for multiple tasks. These networks are typically trained independently for each task by varying the output layer(s) and training objective. In this work we present a new model fo… ▽ More

    Submitted 19 September, 2016; v1 submitted 25 April, 2016; originally announced April 2016.

  44. arXiv:1509.06033  [pdf, other

    cs.CV

    Deep Convolutional Features for Image Based Retrieval and Scene Categorization

    Authors: Arsalan Mousavian, Jana Kosecka

    Abstract: Several recent approaches showed how the representations learned by Convolutional Neural Networks can be repurposed for novel tasks. Most commonly it has been shown that the activation features of the last fully connected layers (fc7 or fc6) of the network, followed by a linear classifier outperform the state-of-the-art on several recognition challenge datasets. Instead of recognition, this paper… ▽ More

    Submitted 20 September, 2015; originally announced September 2015.