-
Constrained Robotic Navigation on Preferred Terrains Using LLMs and Speech Instruction: Exploiting the Power of Adverbs
Authors:
Faraz Lotfi,
Farnoosh Faraji,
Nikhil Kakodkar,
Travis Manderson,
David Meger,
Gregory Dudek
Abstract:
This paper explores leveraging large language models for map-free off-road navigation using generative AI, reducing the need for traditional data collection and annotation. We propose a method where a robot receives verbal instructions, converted to text through Whisper, and a large language model (LLM) model extracts landmarks, preferred terrains, and crucial adverbs translated into speed setting…
▽ More
This paper explores leveraging large language models for map-free off-road navigation using generative AI, reducing the need for traditional data collection and annotation. We propose a method where a robot receives verbal instructions, converted to text through Whisper, and a large language model (LLM) model extracts landmarks, preferred terrains, and crucial adverbs translated into speed settings for constrained navigation. A language-driven semantic segmentation model generates text-based masks for identifying landmarks and terrain types in images. By translating 2D image points to the vehicle's motion plane using camera parameters, an MPC controller can guides the vehicle towards the desired terrain. This approach enhances adaptation to diverse environments and facilitates the use of high-level instructions for navigating complex and challenging terrains.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
CARTIER: Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots
Authors:
Dmitriy Rivkin,
Nikhil Kakodkar,
Francois Hogan,
Bobak H. Baghi,
Gregory Dudek
Abstract:
This work explores the capacity of large language models (LLMs) to address problems at the intersection of spatial planning and natural language interfaces for navigation. We focus on following complex instructions that are more akin to natural conversation than traditional explicit procedural directives typically seen in robotics. Unlike most prior work where navigation directives are provided as…
▽ More
This work explores the capacity of large language models (LLMs) to address problems at the intersection of spatial planning and natural language interfaces for navigation. We focus on following complex instructions that are more akin to natural conversation than traditional explicit procedural directives typically seen in robotics. Unlike most prior work where navigation directives are provided as simple imperative commands (e.g., "go to the fridge"), we examine implicit directives obtained through conversational interactions.We leverage the 3D simulator AI2Thor to create household query scenarios at scale, and augment it by adding complex language queries for 40 object types. We demonstrate that a robot using our method CARTIER (Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots) can parse descriptive language queries up to 42% more reliably than existing LLM-enabled methods by exploiting the ability of LLMs to interpret the user interaction in the context of the objects in the scenario.
△ Less
Submitted 1 February, 2024; v1 submitted 21 July, 2023;
originally announced July 2023.
-
ANSEL Photobot: A Robot Event Photographer with Semantic Intelligence
Authors:
Dmitriy Rivkin,
Gregory Dudek,
Nikhil Kakodkar,
David Meger,
Oliver Limoyo,
Xue Liu,
Francois Hogan
Abstract:
Our work examines the way in which large language models can be used for robotic planning and sampling, specifically the context of automated photographic documentation. Specifically, we illustrate how to produce a photo-taking robot with an exceptional level of semantic awareness by leveraging recent advances in general purpose language (LM) and vision-language (VLM) models. Given a high-level de…
▽ More
Our work examines the way in which large language models can be used for robotic planning and sampling, specifically the context of automated photographic documentation. Specifically, we illustrate how to produce a photo-taking robot with an exceptional level of semantic awareness by leveraging recent advances in general purpose language (LM) and vision-language (VLM) models. Given a high-level description of an event we use an LM to generate a natural-language list of photo descriptions that one would expect a photographer to capture at the event. We then use a VLM to identify the best matches to these descriptions in the robot's video stream. The photo portfolios generated by our method are consistently rated as more appropriate to the event by human evaluators than those generated by existing methods.
△ Less
Submitted 15 February, 2023;
originally announced February 2023.