-
DREAM: Domain-aware Reasoning for Efficient Autonomous Underwater Monitoring
Authors:
Zhenqi Wu,
Abhinav Modi,
Angelos Mavrogiannis,
Kaustubh Joshi,
Nikhil Chopra,
Yiannis Aloimonos,
Nare Karapetyan,
Ioannis Rekleitis,
Xiaomin Lin
Abstract:
The ocean is warming and acidifying, increasing the risk of mass mortality events for temperature-sensitive shellfish such as oysters. This motivates the development of long-term monitoring systems. However, human labor is costly and long-duration underwater work is highly hazardous, thus favoring robotic solutions as a safer and more efficient option. To enable underwater robots to make real-time…
▽ More
The ocean is warming and acidifying, increasing the risk of mass mortality events for temperature-sensitive shellfish such as oysters. This motivates the development of long-term monitoring systems. However, human labor is costly and long-duration underwater work is highly hazardous, thus favoring robotic solutions as a safer and more efficient option. To enable underwater robots to make real-time, environment-aware decisions without human intervention, we must equip them with an intelligent "brain." This highlights the need for persistent,wide-area, and low-cost benthic monitoring. To this end, we present DREAM, a Vision Language Model (VLM)-guided autonomy framework for long-term underwater exploration and habitat monitoring. The results show that our framework is highly efficient in finding and exploring target objects (e.g., oysters, shipwrecks) without prior location information. In the oyster-monitoring task, our framework takes 31.5% less time than the previous baseline with the same amount of oysters. Compared to the vanilla VLM, it uses 23% fewer steps while covering 8.88% more oysters. In shipwreck scenes, our framework successfully explores and maps the wreck without collisions, requiring 27.5% fewer steps than the vanilla model and achieving 100% coverage, while the vanilla model achieves 60.23% average coverage in our shipwreck environments.
△ Less
Submitted 16 September, 2025;
originally announced September 2025.
-
Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs
Authors:
Angelos Mavrogiannis,
Dehao Yuan,
Yiannis Aloimonos
Abstract:
There has been a lot of interest in grounding natural language to physical entities through visual context. While Vision Language Models (VLMs) can ground linguistic instructions to visual sensory information, they struggle with grounding non-visual attributes, like the weight of an object. Our key insight is that non-visual attribute detection can be effectively achieved by active perception guid…
▽ More
There has been a lot of interest in grounding natural language to physical entities through visual context. While Vision Language Models (VLMs) can ground linguistic instructions to visual sensory information, they struggle with grounding non-visual attributes, like the weight of an object. Our key insight is that non-visual attribute detection can be effectively achieved by active perception guided by visual reasoning. To this end, we present a perception-action API that consists of VLMs and Large Language Models (LLMs) as backbones, together with a set of robot control functions. When prompted with this API and a natural language query, an LLM generates a program to actively identify attributes given an input image. Offline testing on the Odd-One-Out dataset demonstrates that our framework outperforms vanilla VLMs in detecting attributes like relative object location, size, and weight. Online testing in realistic household scenes on AI2-THOR and a real robot demonstration on a DJI RoboMaster EP robot highlight the efficacy of our approach.
△ Less
Submitted 6 March, 2025; v1 submitted 23 September, 2024;
originally announced September 2024.
-
Cook2LTL: Translating Cooking Recipes to LTL Formulae using Large Language Models
Authors:
Angelos Mavrogiannis,
Christoforos Mavrogiannis,
Yiannis Aloimonos
Abstract:
Cooking recipes are challenging to translate to robot plans as they feature rich linguistic complexity, temporally-extended interconnected tasks, and an almost infinite space of possible actions. Our key insight is that combining a source of cooking domain knowledge with a formalism that captures the temporal richness of cooking recipes could enable the extraction of unambiguous, robot-executable…
▽ More
Cooking recipes are challenging to translate to robot plans as they feature rich linguistic complexity, temporally-extended interconnected tasks, and an almost infinite space of possible actions. Our key insight is that combining a source of cooking domain knowledge with a formalism that captures the temporal richness of cooking recipes could enable the extraction of unambiguous, robot-executable plans. In this work, we use Linear Temporal Logic (LTL) as a formal language expressive enough to model the temporal nature of cooking recipes. Leveraging a pretrained Large Language Model (LLM), we present Cook2LTL, a system that translates instruction steps from an arbitrary cooking recipe found on the internet to a set of LTL formulae, grounding high-level cooking actions to a set of primitive actions that are executable by a manipulator in a kitchen environment. Cook2LTL makes use of a caching scheme that dynamically builds a queryable action library at runtime. We instantiate Cook2LTL in a realistic simulation environment (AI2-THOR), and evaluate its performance across a series of cooking recipes. We demonstrate that our system significantly decreases LLM API calls (-51%), latency (-59%), and cost (-42%) compared to a baseline that queries the LLM for every newly encountered action at runtime.
△ Less
Submitted 7 March, 2024; v1 submitted 29 September, 2023;
originally announced October 2023.
-
B-GAP: Behavior-Rich Simulation and Navigation for Autonomous Driving
Authors:
Angelos Mavrogiannis,
Rohan Chandra,
Dinesh Manocha
Abstract:
We address the problem of ego-vehicle navigation in dense simulated traffic environments populated by road agents with varying driver behaviors. Navigation in such environments is challenging due to unpredictability in agents' actions caused by their heterogeneous behaviors. We present a new simulation technique consisting of enriching existing traffic simulators with behavior-rich trajectories co…
▽ More
We address the problem of ego-vehicle navigation in dense simulated traffic environments populated by road agents with varying driver behaviors. Navigation in such environments is challenging due to unpredictability in agents' actions caused by their heterogeneous behaviors. We present a new simulation technique consisting of enriching existing traffic simulators with behavior-rich trajectories corresponding to varying levels of aggressiveness. We generate these trajectories with the help of a driver behavior modeling algorithm. We then use the enriched simulator to train a deep reinforcement learning (DRL) policy that consists of a set of high-level vehicle control commands and use this policy at test time to perform local navigation in dense traffic. Our policy implicitly models the interactions between traffic agents and computes safe trajectories for the ego-vehicle accounting for aggressive driver maneuvers such as overtaking, over-speeding, weaving, and sudden lane changes. Our enhanced behavior-rich simulator can be used for generating datasets that consist of trajectories corresponding to diverse driver behaviors and traffic densities, and our behavior-based navigation scheme can be combined with state-of-the-art navigation algorithms.
△ Less
Submitted 13 February, 2022; v1 submitted 7 November, 2020;
originally announced November 2020.