Search | arXiv e-print repository

IK Seed Generator for Dual-Arm Human-like Physicality Robot with Mobile Base

Authors: Jun Takamatsu, Atsushi Kanehira, Kazuhiro Sasabuchi, Naoki Wake, Katsushi Ikeuchi

Abstract: Robots are strongly expected as a means of replacing human tasks. If a robot has a human-like physicality, the possibility of replacing human tasks increases. In the case of household service robots, it is desirable for them to be on a human-like size so that they do not become excessively large in order to coexist with humans in their operating environment. However, robots with size limitations t… ▽ More Robots are strongly expected as a means of replacing human tasks. If a robot has a human-like physicality, the possibility of replacing human tasks increases. In the case of household service robots, it is desirable for them to be on a human-like size so that they do not become excessively large in order to coexist with humans in their operating environment. However, robots with size limitations tend to have difficulty solving inverse kinematics (IK) due to mechanical limitations, such as joint angle limitations. Conversely, if the difficulty coming from this limitation could be mitigated, one can expect that the use of such robots becomes more valuable. In numerical IK solver, which is commonly used for robots with higher degrees-of-freedom (DOF), the solvability of IK depends on the initial guess given to the solver. Thus, this paper proposes a method for generating a good initial guess for a numerical IK solver given the target hand configuration. For the purpose, we define the goodness of an initial guess using the scaled Jacobian matrix, which can calculate the manipulability index considering the joint limits. These two factors are related to the difficulty of solving IK. We generate the initial guess by optimizing the goodness using the genetic algorithm (GA). To enumerate much possible IK solutions, we use the reachability map that represents the reachable area of the robot hand in the arm-base coordinate system. We conduct quantitative evaluation and prove that using an initial guess that is judged to be better using the goodness value increases the probability that IK is solved. Finally, as an application of the proposed method, we show that by generating good initial guesses for IK a robot actually achieves three typical scenarios. △ Less

Submitted 1 May, 2025; originally announced May 2025.

Comments: 8 pages, 12 figures, 4 tables

arXiv:2504.18084 [pdf, other]

RL-Driven Data Generation for Robust Vision-Based Dexterous Grasping

Authors: Atsushi Kanehira, Naoki Wake, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

Abstract: This work presents reinforcement learning (RL)-driven data augmentation to improve the generalization of vision-action (VA) models for dexterous grasping. While real-to-sim-to-real frameworks, where a few real demonstrations seed large-scale simulated data, have proven effective for VA models, applying them to dexterous settings remains challenging: obtaining stable multi-finger contacts is nontri… ▽ More This work presents reinforcement learning (RL)-driven data augmentation to improve the generalization of vision-action (VA) models for dexterous grasping. While real-to-sim-to-real frameworks, where a few real demonstrations seed large-scale simulated data, have proven effective for VA models, applying them to dexterous settings remains challenging: obtaining stable multi-finger contacts is nontrivial across diverse object shapes. To address this, we leverage RL to generate contact-rich grasping data across varied geometries. In line with the real-to-sim-to-real paradigm, the grasp skill is formulated as a parameterized and tunable reference trajectory refined by a residual policy learned via RL. This modular design enables trajectory-level control that is both consistent with real demonstrations and adaptable to diverse object geometries. A vision-conditioned policy trained on simulation-augmented data demonstrates strong generalization to unseen objects, highlighting the potential of our approach to alleviate the data bottleneck in training VA models. △ Less

Submitted 25 April, 2025; originally announced April 2025.

arXiv:2504.04939 [pdf, other]

A Taxonomy of Self-Handover

Authors: Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

Abstract: Self-handover, transferring an object between one's own hands, is a common but understudied bimanual action. While it facilitates seamless transitions in complex tasks, the strategies underlying its execution remain largely unexplored. Here, we introduce the first systematic taxonomy of self-handover, derived from manual annotation of over 12 hours of cooking activity performed by 21 participants.… ▽ More Self-handover, transferring an object between one's own hands, is a common but understudied bimanual action. While it facilitates seamless transitions in complex tasks, the strategies underlying its execution remain largely unexplored. Here, we introduce the first systematic taxonomy of self-handover, derived from manual annotation of over 12 hours of cooking activity performed by 21 participants. Our analysis reveals that self-handover is not merely a passive transition, but a highly coordinated action involving anticipatory adjustments by both hands. As a step toward automated analysis of human manipulation, we further demonstrate the feasibility of classifying self-handover types using a state-of-the-art vision-language model. These findings offer fresh insights into bimanual coordination, underscoring the role of self-handover in enabling smooth task transitions-an ability essential for adaptive dual-arm robotics. △ Less

Submitted 8 April, 2025; v1 submitted 7 April, 2025; originally announced April 2025.

Comments: 8 pages, 8 figures, 1 table, Last updated on April 7th, 2025

arXiv:2504.01252 [pdf, other]

Plan-and-Act using Large Language Models for Interactive Agreement

Authors: Kazuhiro Sasabuchi, Naoki Wake, Atsushi Kanehira, Jun Takamatsu, Katsushi Ikeuchi

Abstract: Recent large language models (LLMs) are capable of planning robot actions. In this paper, we explore how LLMs can be used for planning actions with tasks involving situational human-robot interaction (HRI). A key problem of applying LLMs in situational HRI is balancing between "respecting the current human's activity" and "prioritizing the robot's task," as well as understanding the timing of when… ▽ More Recent large language models (LLMs) are capable of planning robot actions. In this paper, we explore how LLMs can be used for planning actions with tasks involving situational human-robot interaction (HRI). A key problem of applying LLMs in situational HRI is balancing between "respecting the current human's activity" and "prioritizing the robot's task," as well as understanding the timing of when to use the LLM to generate an action plan. In this paper, we propose a necessary plan-and-act skill design to solve the above problems. We show that a critical factor for enabling a robot to switch between passive / active interaction behavior is to provide the LLM with an action text about the current robot's action. We also show that a second-stage question to the LLM (about the next timing to call the LLM) is necessary for planning actions at an appropriate timing. The skill design is applied to an Engage skill and is tested on four distinct interaction scenarios. We show that by using the skill design, LLMs can be leveraged to easily scale to different HRI scenarios with a reasonable success rate reaching 90% on the test scenarios. △ Less

Submitted 1 April, 2025; originally announced April 2025.

arXiv:2503.15491 [pdf, other]

Agreeing to Interact in Human-Robot Interaction using Large Language Models and Vision Language Models

Authors: Kazuhiro Sasabuchi, Naoki Wake, Atsushi Kanehira, Jun Takamatsu, Katsushi Ikeuchi

Abstract: In human-robot interaction (HRI), the beginning of an interaction is often complex. Whether the robot should communicate with the human is dependent on several situational factors (e.g., the current human's activity, urgency of the interaction, etc.). We test whether large language models (LLM) and vision language models (VLM) can provide solutions to this problem. We compare four different system… ▽ More In human-robot interaction (HRI), the beginning of an interaction is often complex. Whether the robot should communicate with the human is dependent on several situational factors (e.g., the current human's activity, urgency of the interaction, etc.). We test whether large language models (LLM) and vision language models (VLM) can provide solutions to this problem. We compare four different system-design patterns using LLMs and VLMs, and test on a test set containing 84 human-robot situations. The test set mixes several publicly available datasets and also includes situations where the appropriate action to take is open-ended. Our results using the GPT-4o and Phi-3 Vision model indicate that LLMs and VLMs are capable of handling interaction beginnings when the desired actions are clear, however, challenge remains in the open-ended situations where the model must balance between the human and robot situation. △ Less

Submitted 7 January, 2025; originally announced March 2025.

arXiv:2501.03968 [pdf, other]

VLM-driven Behavior Tree for Context-aware Task Planning

Authors: Naoki Wake, Atsushi Kanehira, Jun Takamatsu, Kazuhiro Sasabuchi, Katsushi Ikeuchi

Abstract: The use of Large Language Models (LLMs) for generating Behavior Trees (BTs) has recently gained attention in the robotics community, yet remains in its early stages of development. In this paper, we propose a novel framework that leverages Vision-Language Models (VLMs) to interactively generate and edit BTs that address visual conditions, enabling context-aware robot operations in visually complex… ▽ More The use of Large Language Models (LLMs) for generating Behavior Trees (BTs) has recently gained attention in the robotics community, yet remains in its early stages of development. In this paper, we propose a novel framework that leverages Vision-Language Models (VLMs) to interactively generate and edit BTs that address visual conditions, enabling context-aware robot operations in visually complex environments. A key feature of our approach lies in the conditional control through self-prompted visual conditions. Specifically, the VLM generates BTs with visual condition nodes, where conditions are expressed as free-form text. Another VLM process integrates the text into its prompt and evaluates the conditions against real-world images during robot execution. We validated our framework in a real-world cafe scenario, demonstrating both its feasibility and limitations. △ Less

Submitted 10 January, 2025; v1 submitted 7 January, 2025; originally announced January 2025.

Comments: 10 pages, 11 figures, 5 tables. Last updated on January 9th, 2024

arXiv:2412.11337 [pdf, other]

Modality-Driven Design for Multi-Step Dexterous Manipulation: Insights from Neuroscience

Authors: Naoki Wake, Atsushi Kanehira, Daichi Saito, Jun Takamatsu, Kazuhiro Sasabuchi, Hideki Koike, Katsushi Ikeuchi

Abstract: Multi-step dexterous manipulation is a fundamental skill in household scenarios, yet remains an underexplored area in robotics. This paper proposes a modular approach, where each step of the manipulation process is addressed with dedicated policies based on effective modality input, rather than relying on a single end-to-end model. To demonstrate this, a dexterous robotic hand performs a manipulat… ▽ More Multi-step dexterous manipulation is a fundamental skill in household scenarios, yet remains an underexplored area in robotics. This paper proposes a modular approach, where each step of the manipulation process is addressed with dedicated policies based on effective modality input, rather than relying on a single end-to-end model. To demonstrate this, a dexterous robotic hand performs a manipulation task involving picking up and rotating a box. Guided by insights from neuroscience, the task is decomposed into three sub-skills, 1)reaching, 2)grasping and lifting, and 3)in-hand rotation, based on the dominant sensory modalities employed in the human brain. Each sub-skill is addressed using distinct methods from a practical perspective: a classical controller, a Vision-Language-Action model, and a reinforcement learning policy with force feedback, respectively. We tested the pipeline on a real robot to demonstrate the feasibility of our approach. The key contribution of this study lies in presenting a neuroscience-inspired, modality-driven methodology for multi-step dexterous manipulation. △ Less

Submitted 15 December, 2024; originally announced December 2024.

Comments: 8 pages, 5 figures, 2 tables. Last updated on December 14th, 2024

arXiv:2408.17422 [pdf, other]

doi 10.1109/ACCESS.2025.3555167

Open-Vocabulary Action Localization with Iterative Visual Prompting

Authors: Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

Abstract: Video action localization aims to find the timings of specific actions from a long video. Although existing learning-based approaches have been successful, they require annotating videos, which comes with a considerable labor cost. This paper proposes a training-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLMs). The challenge stems from the fact that VLMs… ▽ More Video action localization aims to find the timings of specific actions from a long video. Although existing learning-based approaches have been successful, they require annotating videos, which comes with a considerable labor cost. This paper proposes a training-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLMs). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames and create a concatenated image with frame index labels, allowing a VLM to identify the frames that most likely correspond to the start and end of the action. By iteratively narrowing the sampling window around the selected frames, the estimation gradually converges to more precise temporal boundaries. We demonstrate that this technique yields reasonable performance, achieving results comparable to state-of-the-art zero-shot action localization. These results support the use of VLMs as a practical tool for understanding videos. Sample code is available at https://microsoft.github.io/VLM-Video-Action-Localization/ △ Less

Submitted 7 April, 2025; v1 submitted 30 August, 2024; originally announced August 2024.

Comments: 9 pages, 5 figures, 6 tables. Published in IEEE Access. Last updated on April 7th, 2025

arXiv:2407.11436 [pdf, other]

APriCoT: Action Primitives based on Contact-state Transition for In-Hand Tool Manipulation

Authors: Daichi Saito, Atsushi Kanehira, Kazuhiro Sasabuchi, Naoki Wake, Jun Takamatsu, Hideki Koike, Katsushi Ikeuchi

Abstract: In-hand tool manipulation is an operation that not only manipulates a tool within the hand (i.e., in-hand manipulation) but also achieves a grasp suitable for a task after the manipulation. This study aims to achieve an in-hand tool manipulation skill through deep reinforcement learning. The difficulty of learning the skill arises because this manipulation requires (A) exploring long-term contact-… ▽ More In-hand tool manipulation is an operation that not only manipulates a tool within the hand (i.e., in-hand manipulation) but also achieves a grasp suitable for a task after the manipulation. This study aims to achieve an in-hand tool manipulation skill through deep reinforcement learning. The difficulty of learning the skill arises because this manipulation requires (A) exploring long-term contact-state changes to achieve the desired grasp and (B) highly-varied motions depending on the contact-state transition. (A) leads to a sparsity of a reward on a successful grasp, and (B) requires an RL agent to explore widely within the state-action space to learn highly-varied actions, leading to sample inefficiency. To address these issues, this study proposes Action Primitives based on Contact-state Transition (APriCoT). APriCoT decomposes the manipulation into short-term action primitives by describing the operation as a contact-state transition based on three action representations (detach, crossover, attach). In each action primitive, fingers are required to perform short-term and similar actions. By training a policy for each primitive, we can mitigate the issues from (A) and (B). This study focuses on a fundamental operation as an example of in-hand tool manipulation: rotating an elongated object grasped with a precision grasp by half a turn to achieve the initial grasp. Experimental results demonstrated that ours succeeded in both the rotation and the achievement of the desired grasp, unlike existing studies. Additionally, it was found that the policy was robust to changes in object shape. △ Less

Submitted 16 July, 2024; originally announced July 2024.

arXiv:2403.02316 [pdf, other]

Designing Library of Skill-Agents for Hardware-Level Reusability

Authors: Jun Takamatsu, Daichi Saito, Katsushi Ikeuchi, Atsushi Kanehira, Kazuhiro Sasabuchi, Naoki Wake

Abstract: To use new robot hardware in a new environment, it is necessary to develop a control program tailored to that specific robot in that environment. Considering the reusability of software among robots is crucial to minimize the effort involved in this process and maximize software reuse across different robots in different environments. This paper proposes a method to remedy this process by consider… ▽ More To use new robot hardware in a new environment, it is necessary to develop a control program tailored to that specific robot in that environment. Considering the reusability of software among robots is crucial to minimize the effort involved in this process and maximize software reuse across different robots in different environments. This paper proposes a method to remedy this process by considering hardware-level reusability, using Learning-from-observation (LfO) paradigm with a pre-designed skill-agent library. The LfO framework represents the required actions in hardware-independent representations, referred to as task models, from observing human demonstrations, capturing the necessary parameters for the interaction between the environment and the robot. When executing the desired actions from the task models, a set of skill agents is employed to convert the representations into robot commands. This paper focuses on the latter part of the LfO framework, utilizing the set to generate robot actions from the task models, and explores a hardware-independent design approach for these skill agents. These skill agents are described in a hardware-independent manner, considering the relative relationship between the robot's hand position and the environment. As a result, it is possible to execute these actions on robots with different hardware configurations by simply swapping the inverse kinematics solver. This paper, first, defines a necessary and sufficient skill-agent set corresponding to cover all possible actions, and considers the design principles for these skill agents in the library. We provide concrete examples of such skill agents and demonstrate the practicality of using these skill agents by showing that the same representations can be executed on two different robots, Nextage and Fetch, using the proposed skill-agents set. △ Less

Submitted 20 March, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

arXiv:2311.12015 [pdf, other]

doi 10.1109/LRA.2024.3477090

GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration

Authors: Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

Abstract: We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), to facilitate one-shot visual teaching for robotic manipulation. This system analyzes videos of humans performing tasks and outputs executable robot programs that incorporate insights into affordances. The process begins with GPT-4V analyzing the videos to obtain textual explanations of environmental and… ▽ More We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), to facilitate one-shot visual teaching for robotic manipulation. This system analyzes videos of humans performing tasks and outputs executable robot programs that incorporate insights into affordances. The process begins with GPT-4V analyzing the videos to obtain textual explanations of environmental and action details. A GPT-4-based task planner then encodes these details into a symbolic task plan. Subsequently, vision systems spatially and temporally ground the task plan in the videos. Objects are identified using an open-vocabulary object detector, and hand-object interactions are analyzed to pinpoint moments of grasping and releasing. This spatiotemporal grounding allows for the gathering of affordance information (e.g., grasp types, waypoints, and body postures) critical for robot execution. Experiments across various scenarios demonstrate the method's efficacy in enabling real robots to operate from one-shot human demonstrations. Meanwhile, quantitative tests have revealed instances of hallucination in GPT-4V, highlighting the importance of incorporating human supervision within the pipeline. The prompts of GPT-4V/GPT-4 are available at this project page: https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/ △ Less

Submitted 26 September, 2024; v1 submitted 20 November, 2023; originally announced November 2023.

Comments: 8 pages, 10 figures, 3 tables. Published in IEEE Robotics and Automation Letters (RA-L) (in press). Last updated on September 26th, 2024

arXiv:2311.11007 [pdf, other]

Constraint-aware Policy for Compliant Manipulation

Authors: Daichi Saito, Kazuhiro Sasabuchi, Naoki Wake, Atsushi Kanehira, Jun Takamatsu, Hideki Koike, Katsushi Ikeuchi

Abstract: Robot manipulation in a physically-constrained environment requires compliant manipulation. Compliant manipulation is a manipulation skill to adjust hand motion based on the force imposed by the environment. Recently, reinforcement learning (RL) has been applied to solve household operations involving compliant manipulation. However, previous RL methods have primarily focused on designing a policy… ▽ More Robot manipulation in a physically-constrained environment requires compliant manipulation. Compliant manipulation is a manipulation skill to adjust hand motion based on the force imposed by the environment. Recently, reinforcement learning (RL) has been applied to solve household operations involving compliant manipulation. However, previous RL methods have primarily focused on designing a policy for a specific operation that limits their applicability and requires separate training for every new operation. We propose a constraint-aware policy that is applicable to various unseen manipulations by grouping several manipulations together based on the type of physical constraint involved. The type of physical constraint determines the characteristic of the imposed force direction; thus, a generalized policy is trained in the environment and reward designed on the basis of this characteristic. This paper focuses on two types of physical constraints: prismatic and revolute joints. Experiments demonstrated that the same policy could successfully execute various compliant-manipulation operations, both in the simulation and reality. We believe this study is the first step toward realizing a generalized household-robot. △ Less

Submitted 18 November, 2023; originally announced November 2023.

arXiv:2310.11753 [pdf, other]

Bias in Emotion Recognition with ChatGPT

Authors: Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

Abstract: This technical report explores the ability of ChatGPT in recognizing emotions from text, which can be the basis of various applications like interactive chatbots, data annotation, and mental health analysis. While prior research has shown ChatGPT's basic ability in sentiment analysis, its performance in more nuanced emotion recognition is not yet explored. Here, we conducted experiments to evaluat… ▽ More This technical report explores the ability of ChatGPT in recognizing emotions from text, which can be the basis of various applications like interactive chatbots, data annotation, and mental health analysis. While prior research has shown ChatGPT's basic ability in sentiment analysis, its performance in more nuanced emotion recognition is not yet explored. Here, we conducted experiments to evaluate its performance of emotion recognition across different datasets and emotion labels. Our findings indicate a reasonable level of reproducibility in its performance, with noticeable improvement through fine-tuning. However, the performance varies with different emotion labels and datasets, highlighting an inherent instability and possible bias. The choice of dataset and emotion labels significantly impacts ChatGPT's emotion recognition performance. This paper sheds light on the importance of dataset and label selection, and the potential of fine-tuning in enhancing ChatGPT's emotion recognition capabilities, providing a groundwork for better integration of emotion analysis in applications using ChatGPT. △ Less

Submitted 4 December, 2023; v1 submitted 18 October, 2023; originally announced October 2023.

Comments: 5 pages, 4 figures, 6 tables

arXiv:2306.01741 [pdf, other]

GPT Models Meet Robotic Applications: Co-Speech Gesturing Chat System

Authors: Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

Abstract: This technical paper introduces a chatting robot system that utilizes recent advancements in large-scale language models (LLMs) such as GPT-3 and ChatGPT. The system is integrated with a co-speech gesture generation system, which selects appropriate gestures based on the conceptual meaning of speech. Our motivation is to explore ways of utilizing the recent progress in LLMs for practical robotic a… ▽ More This technical paper introduces a chatting robot system that utilizes recent advancements in large-scale language models (LLMs) such as GPT-3 and ChatGPT. The system is integrated with a co-speech gesture generation system, which selects appropriate gestures based on the conceptual meaning of speech. Our motivation is to explore ways of utilizing the recent progress in LLMs for practical robotic applications, which benefits the development of both chatbots and LLMs. Specifically, it enables the development of highly responsive chatbot systems by leveraging LLMs and adds visual effects to the user interface of LLMs as an additional value. The source code for the system is available on GitHub for our in-house robot (https://github.com/microsoft/LabanotationSuite/tree/master/MSRAbotChatSimulation) and GitHub for Toyota HSR (https://github.com/microsoft/GPT-Enabled-HSR-CoSpeechGestures). △ Less

Submitted 10 May, 2023; originally announced June 2023.

arXiv:2304.09966 [pdf, other]

Applying Learning-from-observation to household service robots: three common-sense formulation

Authors: Katsushi Ikeuchi, Jun Takamatsu, Kazuhiro Sasabuchi, Naoki Wake, Atsushi Kanehiro

Abstract: Utilizing a robot in a new application requires the robot to be programmed at each time. To reduce such programmings efforts, we have been developing ``Learning-from-observation (LfO)'' that automatically generates robot programs by observing human demonstrations. One of the main issues with introducing this LfO system into the domain of household tasks is the cluttered environments, which cause d… ▽ More Utilizing a robot in a new application requires the robot to be programmed at each time. To reduce such programmings efforts, we have been developing ``Learning-from-observation (LfO)'' that automatically generates robot programs by observing human demonstrations. One of the main issues with introducing this LfO system into the domain of household tasks is the cluttered environments, which cause difficulty in determining which elements are important for task execution when observing demonstrations. To overcome this issue, it is necessary for the system to have common sense shared with the human demonstrator. This paper addresses three relationships that LfO in the household domain should focus on when observing demonstrations and proposes representations to describe the common sense used by the demonstrator for optimal execution of task sequences. Specifically, the paper proposes to use labanotation to describe the postures between the environment and the robot, contact-webs to describe the grasping methods between the robot and the tool, and physical and semantic constraints to describe the motions between the tool and the environment. Then, based on these representations, the paper formulates task models, machine-independent robot programs, that indicate what to do and how to do. Third, the paper explains the task encoder to obtain task models and task decoder to execute the task models on the robot hardware. Finally, this paper presents how the system actually works through several example scenes. △ Less

Submitted 19 April, 2023; originally announced April 2023.

arXiv:2304.03893 [pdf, other]

doi 10.1109/ACCESS.2023.3310935

ChatGPT Empowered Long-Step Robot Control in Various Environments: A Case Application

Authors: Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

Abstract: This paper demonstrates how OpenAI's ChatGPT can be used in a few-shot setting to convert natural language instructions into a sequence of executable robot actions. The paper proposes easy-to-customize input prompts for ChatGPT that meet common requirements in practical applications, such as easy integration with robot execution systems and applicability to various environments while minimizing th… ▽ More This paper demonstrates how OpenAI's ChatGPT can be used in a few-shot setting to convert natural language instructions into a sequence of executable robot actions. The paper proposes easy-to-customize input prompts for ChatGPT that meet common requirements in practical applications, such as easy integration with robot execution systems and applicability to various environments while minimizing the impact of ChatGPT's token limit. The prompts encourage ChatGPT to output a sequence of predefined robot actions, represent the operating environment in a formalized style, and infer the updated state of the operating environment. Experiments confirmed that the proposed prompts enable ChatGPT to act according to requirements in various environments, and users can adjust ChatGPT's output with natural language feedback for safe and robust operation. The proposed prompts and source code are open-source and publicly available at https://github.com/microsoft/ChatGPT-Robot-Manipulation-Prompts △ Less

Submitted 29 August, 2023; v1 submitted 7 April, 2023; originally announced April 2023.

Comments: 21 figures, 7 tables. Published in IEEE Access (in press). Last updated August 29th, 2023

arXiv:2301.01382 [pdf, other]

Task-sequencing Simulator: Integrated Machine Learning to Execution Simulation for Robot Manipulation

Authors: Kazuhiro Sasabuchi, Daichi Saito, Atsushi Kanehira, Naoki Wake, Jun Takamatsu, Katsushi Ikeuchi

Abstract: A task-sequencing simulator in robotics manipulation to integrate simulation-for-learning and simulation-for-execution is introduced. Unlike existing machine-learning simulation where a non-decomposed simulation is used to simulate a training scenario, the task-sequencing simulator runs a composed simulation using building blocks. This way, the simulation-for-learning is structured similarly to a… ▽ More A task-sequencing simulator in robotics manipulation to integrate simulation-for-learning and simulation-for-execution is introduced. Unlike existing machine-learning simulation where a non-decomposed simulation is used to simulate a training scenario, the task-sequencing simulator runs a composed simulation using building blocks. This way, the simulation-for-learning is structured similarly to a multi-step simulation-for-execution. To compose both learning and execution scenarios, a unified trainable-and-composable description of blocks called a concept model is proposed and used. Using the simulator design and concept models, a reusable simulator for learning different tasks, a common-ground system for learning-to-execution, simulation-to-real is achieved and shown. △ Less

Submitted 3 January, 2023; originally announced January 2023.

Comments: 7 pages, 6 figures

arXiv:2212.10787 [pdf, other]

doi 10.1109/AIM46323.2023.10196126

Interactive Task Encoding System for Learning-from-Observation

Authors: Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

Abstract: We present the Interactive Task Encoding System (ITES) for teaching robots to perform manipulative tasks. ITES is designed as an input system for the Learning-from-Observation (LfO) framework, which enables household robots to be programmed using few-shot human demonstrations without the need for coding. In contrast to previous LfO systems that rely solely on visual demonstrations, ITES leverages… ▽ More We present the Interactive Task Encoding System (ITES) for teaching robots to perform manipulative tasks. ITES is designed as an input system for the Learning-from-Observation (LfO) framework, which enables household robots to be programmed using few-shot human demonstrations without the need for coding. In contrast to previous LfO systems that rely solely on visual demonstrations, ITES leverages both verbal instructions and interaction to enhance recognition robustness, thus enabling multimodal LfO. ITES identifies tasks from verbal instructions and extracts parameters from visual demonstrations. Meanwhile, the recognition result was reviewed by the user for interactive correction. Evaluations conducted on a real robot demonstrate the successful teaching of multiple operations for several scenarios, suggesting the usefulness of ITES for multimodal LfO. The source code is available at https://github.com/microsoft/symbolic-robot-teaching-interface. △ Less

Submitted 28 April, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

Comments: 6 pages, 9 figures. Submitted to and accepted by 2023 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM). Last updated April 28st, 2023

arXiv:2212.09242 [pdf, other]

Learning-from-Observation System Considering Hardware-Level Reusability

Authors: Jun Takamatsu, Kazuhiro Sasabuchi, Naoki Wake, Atsushi Kanehira, Katsushi Ikeuchi

Abstract: Robot developers develop various types of robots for satisfying users' various demands. Users' demands are related to their backgrounds and robots suitable for users may vary. If a certain developer would offer a robot that is different from the usual to a user, the robot-specific software has to be changed. On the other hand, robot-software developers would like to reuse their developed software… ▽ More Robot developers develop various types of robots for satisfying users' various demands. Users' demands are related to their backgrounds and robots suitable for users may vary. If a certain developer would offer a robot that is different from the usual to a user, the robot-specific software has to be changed. On the other hand, robot-software developers would like to reuse their developed software as much as possible to reduce their efforts. We propose the system design considering hardware-level reusability. For this purpose, we begin with the learning-from-observation framework. This framework represents a target task in robot-agnostic representation, and thus the represented task description can be shared with various robots. When executing the task, it is necessary to convert the robot-agnostic description into commands of a target robot. To increase the reusability, first, we implement the skill library, robot motion primitives, only considering a robot hand and we regarded that a robot was just a carrier to move the hand on the target trajectory. The skill library is reusable if we would like to the same robot hand. Second, we employ the generic IK solver to quickly swap a robot. We verify the hardware-level reusability by applying two task descriptions to two different robots, Nextage and Fetch. △ Less

Submitted 18 December, 2022; originally announced December 2022.

Comments: 5 pages, 4 figures

arXiv:2203.15290 [pdf, other]

Design strategies for controlling neuron-connected robots using reinforcement learning

Authors: Haruto Sawada, Naoki Wake, Kazuhiro Sasabuchi, Jun Takamatsu, Hirokazu Takahashi, Katsushi Ikeuchi

Abstract: Despite the growing interest in robot control utilizing the computation of biological neurons, context-dependent behavior by neuron-connected robots remains a challenge. Context-dependent behavior here is defined as behavior that is not the result of a simple sensory-motor coupling, but rather based on an understanding of the task goal. This paper proposes design principles for training neuron-con… ▽ More Despite the growing interest in robot control utilizing the computation of biological neurons, context-dependent behavior by neuron-connected robots remains a challenge. Context-dependent behavior here is defined as behavior that is not the result of a simple sensory-motor coupling, but rather based on an understanding of the task goal. This paper proposes design principles for training neuron-connected robots based on task goals to achieve context-dependent behavior. First, we employ deep reinforcement learning (RL) to enable training that accounts for goal achievements. Second, we propose a neuron simulator as a probability distribution based on recorded neural data, aiming to represent physiologically valid neural dynamics while avoiding complex modeling with high computational costs. Furthermore, we propose to update the simulators during the training to bridge the gap between the simulation and the real settings. The experiments showed that the robot gradually learned context-dependent behaviors in pole balancing and robot navigation tasks. Moreover, the learned policies were valid for neural simulators based on novel neural data, and the task performance increased by updating the simulators during training. These results suggest the effectiveness of the proposed design principle for the context-dependent behavior of neuron-connected robots. △ Less

Submitted 29 March, 2022; originally announced March 2022.

Comments: Last updated March 29th, 2022

arXiv:2203.00733 [pdf, other]

Task-grasping from human demonstration

Authors: Daichi Saito, Kazuhiro Sasabuchi, Naoki Wake, Jun Takamatsu, Hideki Koike, Katsushi Ikeuchi

Abstract: A challenge in robot grasping is to achieve task-grasping which is to select a grasp that is advantageous to the success of tasks before and after grasps. One of the frameworks to address this difficulty is Learning-from-Observation (LfO), which obtains various hints from human demonstrations. This paper solves three issues in the grasping skills in the LfO framework: 1) how to functionally mimic… ▽ More A challenge in robot grasping is to achieve task-grasping which is to select a grasp that is advantageous to the success of tasks before and after grasps. One of the frameworks to address this difficulty is Learning-from-Observation (LfO), which obtains various hints from human demonstrations. This paper solves three issues in the grasping skills in the LfO framework: 1) how to functionally mimic human-demonstrated grasps to robots with limited grasp capability, 2) how to coordinate grasp skills with reaching body mimicking, 3) how to robustly perform grasps under object pose and shape uncertainty. A deep reinforcement learning using contact-web based rewards and domain randomization of approach directions is proposed to achieve such robust mimicked grasping skills. Experiment results show that the trained grasping skills can be applied in an LfO system and executed on a real robot. In addition, it is shown that the trained skill is robust to errors in the object pose and to the uncertainty of the object shape and can be combined with various reach-coordination. △ Less

Submitted 1 March, 2022; originally announced March 2022.

Comments: 7 pages, 8 figures

arXiv:2103.02201 [pdf, other]

Semantic constraints to represent common sense required in household actions for multi-modal Learning-from-observation robot

Authors: Katsushi Ikeuchi, Naoki Wake, Riku Arakawa, Kazuhiro Sasabuchi, Jun Takamatsu

Abstract: The paradigm of learning-from-observation (LfO) enables a robot to learn how to perform actions by observing human-demonstrated actions. Previous research in LfO have mainly focused on the industrial domain which only consist of the observable physical constraints between a manipulating tool and the robot's working environment. In order to extend this paradigm to the household domain which consist… ▽ More The paradigm of learning-from-observation (LfO) enables a robot to learn how to perform actions by observing human-demonstrated actions. Previous research in LfO have mainly focused on the industrial domain which only consist of the observable physical constraints between a manipulating tool and the robot's working environment. In order to extend this paradigm to the household domain which consists non-observable constraints derived from a human's common sense; we introduce the idea of semantic constraints. The semantic constraints are represented similar to the physical constraints by defining a contact with an imaginary semantic environment. We thoroughly investigate the necessary and sufficient set of contact state and state transitions to understand the different types of physical and semantic constraints. We then apply our constraint representation to analyze various actions in top hit household YouTube videos and real home cooking recordings. We further categorize the frequently appearing constraint patterns into physical, semantic, and multistage task groups and verify that these groups are not only necessary but a sufficient set for covering standard household actions. Finally, we conduct a preliminary experiment using textual input to explore the possibilities of combining verbal and visual input for recognizing the task groups. Our results provide promising directions for incorporating common sense in the literature of robot teaching. △ Less

Submitted 3 March, 2021; originally announced March 2021.

Comments: 18 pages, 31 figures

arXiv:2103.00268 [pdf, other]

doi 10.1007/s00138-023-01408-z

Text-driven object affordance for guiding grasp-type recognition in multimodal robot teaching

Authors: Naoki Wake, Daichi Saito, Kazuhiro Sasabuchi, Hideki Koike, Katsushi Ikeuchi

Abstract: This study investigates how text-driven object affordance, which provides prior knowledge about grasp types for each object, affects image-based grasp-type recognition in robot teaching. The researchers created labeled datasets of first-person hand images to examine the impact of object affordance on recognition performance. They evaluated scenarios with real and illusory objects, considering mixe… ▽ More This study investigates how text-driven object affordance, which provides prior knowledge about grasp types for each object, affects image-based grasp-type recognition in robot teaching. The researchers created labeled datasets of first-person hand images to examine the impact of object affordance on recognition performance. They evaluated scenarios with real and illusory objects, considering mixed reality teaching conditions where visual object information may be limited. The results demonstrate that object affordance improves image-based recognition by filtering out unlikely grasp types and emphasizing likely ones. The effectiveness of object affordance was more pronounced when there was a stronger bias towards specific grasp types for each object. These findings highlight the significance of object affordance in multimodal robot teaching, regardless of whether real objects are present in the images. Sample code is available on https://github.com/microsoft/arr-grasp-type-recognition. △ Less

Submitted 12 May, 2023; v1 submitted 27 February, 2021; originally announced March 2021.

Comments: 8 pages, 11 figures. Last updated March 12, 2023 Accepted for publication in Machine Vision and Applications

arXiv:2101.05061 [pdf, other]

Understanding Action Sequences based on Video Captioning for Learning-from-Observation

Authors: Iori Yanokura, Naoki Wake, Kazuhiro Sasabuchi, Katsushi Ikeuchi, Masayuki Inaba

Abstract: Learning actions from human demonstration video is promising for intelligent robotic systems. Extracting the exact section and re-observing the extracted video section in detail is important for imitating complex skills because human motions give valuable hints for robots. However, the general video understanding methods focus more on the understanding of the full frame,lacking consideration on ex… ▽ More Learning actions from human demonstration video is promising for intelligent robotic systems. Extracting the exact section and re-observing the extracted video section in detail is important for imitating complex skills because human motions give valuable hints for robots. However, the general video understanding methods focus more on the understanding of the full frame,lacking consideration on extracting accurate sections and aligning them with the human's intent. We propose a Learning-from-Observation framework that splits and understands a video of a human demonstration with verbal instructions to extract accurate action sequences. The splitting is done based on local minimum points of the hand velocity, which align human daily-life actions with object-centered face contact transitions required for generating robot motion. Then, we extract a motion description on the split videos using video captioning techniques that are trained from our new daily-life action video dataset. Finally, we match the motion descriptions with the verbal instructions to understand the correct human intent and ignore the unintended actions inside the video. We evaluate the validity of hand velocity-based video splitting and demonstrate that it is effective. The experimental results on our new video captioning dataset focusing on daily-life human actions demonstrate the effectiveness of the proposed method. The source code, trained models, and the dataset will be made available. △ Less

Submitted 9 December, 2020; originally announced January 2021.

arXiv:2010.06194 [pdf]

Labeling the Phrases of a Conversational Agent with a Unique Personalized Vocabulary

Authors: Naoki Wake, Machiko Sato, Kazuhiro Sasabuchi, Minako Nakamura, Katsushi Ikeuchi

Abstract: Mapping spoken text to gestures is an important research topic for robots with conversation capabilities. According to studies on human co-speech gestures, a reasonable solution for mapping is using a concept-based approach in which a text is first mapped to a semantic cluster (i.e., a concept) containing texts with similar meanings. Subsequently, each concept is mapped to a predefined gesture. By… ▽ More Mapping spoken text to gestures is an important research topic for robots with conversation capabilities. According to studies on human co-speech gestures, a reasonable solution for mapping is using a concept-based approach in which a text is first mapped to a semantic cluster (i.e., a concept) containing texts with similar meanings. Subsequently, each concept is mapped to a predefined gesture. By using a concept-based approach, this paper discusses the practical issue of obtaining concepts for a unique vocabulary personalized for a conversational agent. Using Microsoft Rinna as an agent, we qualitatively compare concepts obtained automatically through a natural language processing (NLP) approach to those obtained manually through a sociological approach. We then identify three limitations of the NLP approach: at the semantic level with emojis and symbols; at the semantic level with slang, new words, and buzzwords; and at the pragmatic level. We attribute these limitations to the personalized vocabulary of Rinna. A follow-up experiment demonstrates that robot gestures selected using a concept-based approach leave a better impression than randomly selected gestures for the Rinna vocabulary, suggesting the usefulness of a concept-based gesture generation system for personalized vocabularies. This study provides insights into the development of gesture generation systems for conversational agents with personalized vocabularies. △ Less

Submitted 12 November, 2021; v1 submitted 13 October, 2020; originally announced October 2020.

Comments: 8 pages, 3 figures. Submitted to and accepted by IEEE/SICE SII 2022. Last updated November 12th, 2021

arXiv:2009.09813 [pdf]

Grasp-type Recognition Leveraging Object Affordance

Authors: Naoki Wake, Kazuhiro Sasabuchi, Katsushi Ikeuchi

Abstract: A key challenge in robot teaching is grasp-type recognition with a single RGB image and a target object name. Here, we propose a simple yet effective pipeline to enhance learning-based recognition by leveraging a prior distribution of grasp types for each object. In the pipeline, a convolutional neural network (CNN) recognizes the grasp type from an RGB image. The recognition result is further cor… ▽ More A key challenge in robot teaching is grasp-type recognition with a single RGB image and a target object name. Here, we propose a simple yet effective pipeline to enhance learning-based recognition by leveraging a prior distribution of grasp types for each object. In the pipeline, a convolutional neural network (CNN) recognizes the grasp type from an RGB image. The recognition result is further corrected using the prior distribution (i.e., affordance), which is associated with the target object name. Experimental results showed that the proposed method outperforms both a CNN-only and an affordance-only method. The results highlight the effectiveness of linguistically-driven object affordance for enhancing grasp-type recognition in robot teaching. △ Less

Submitted 26 August, 2020; originally announced September 2020.

Comments: 2 pages, 2 figures. Submitted to and accepted by HOBI (IEEE RO-MAN Workshop 2020). Last updated August 26th, 2020

arXiv:2008.01513 [pdf]

doi 10.1109/IEEECONF49454.2021.9382750

A Learning-from-Observation Framework: One-Shot Robot Teaching for Grasp-Manipulation-Release Household Operations

Authors: Naoki Wake, Riku Arakawa, Iori Yanokura, Takuya Kiyokawa, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

Abstract: A household robot is expected to perform various manipulative operations with an understanding of the purpose of the task. To this end, a desirable robotic application should provide an on-site robot teaching framework for non-experts. Here we propose a Learning-from-Observation (LfO) framework for grasp-manipulation-release class household operations (GMR-operations). The framework maps human dem… ▽ More A household robot is expected to perform various manipulative operations with an understanding of the purpose of the task. To this end, a desirable robotic application should provide an on-site robot teaching framework for non-experts. Here we propose a Learning-from-Observation (LfO) framework for grasp-manipulation-release class household operations (GMR-operations). The framework maps human demonstrations to predefined task models through one-shot teaching. Each task model contains both high-level knowledge regarding the geometric constraints and low-level knowledge related to human postures. The key idea is to design a task model that 1) covers various GMR-operations and 2) includes human postures to achieve tasks. We verify the applicability of our framework by testing an operational LfO system with a real robot. In addition, we quantify the coverage of the task model by analyzing online videos of household operations. In the context of one-shot robot teaching, the contribution of this study is a framework that 1) covers various GMR-operations and 2) mimics human postures during the operations. △ Less

Submitted 20 October, 2020; v1 submitted 4 August, 2020; originally announced August 2020.

Comments: 6 pages, 6 figures. Submitted to and accepted by IEEE/SICE SII 2021. Last updated October 20th, 2020

arXiv:2007.08750 [pdf, other]

doi 10.1109/LRA.2020.3044029

Task-oriented Motion Mapping on Robots of Various Configuration using Body Role Division

Authors: Kazuhiro Sasabuchi, Naoki Wake, Katsushi Ikeuchi

Abstract: Many works in robot teaching either focus only on teaching task knowledge, such as geometric constraints, or motion knowledge, such as the motion for accomplishing a task. However, to effectively teach a complex task sequence to a robot, it is important to take advantage of both task and motion knowledge. The task knowledge provides the goals of each individual task within the sequence and reduces… ▽ More Many works in robot teaching either focus only on teaching task knowledge, such as geometric constraints, or motion knowledge, such as the motion for accomplishing a task. However, to effectively teach a complex task sequence to a robot, it is important to take advantage of both task and motion knowledge. The task knowledge provides the goals of each individual task within the sequence and reduces the number of required human demonstrations, whereas the motion knowledge contain the task-to-task constraints that would otherwise require expert knowledge to model the problem. In this paper, we propose a body role division approach that combines both types of knowledge using a single human demonstration. The method is inspired by facts on human body motion and uses a body structural analogy to decompose a robot's body configuration into different roles: body parts that are dominant for imitating the human motion and body parts that are substitutional for adjusting the imitation with respect to the task knowledge. Our results show that our method scales to robots of different number of arm links, guides a robot's configuration to one that achieves an upcoming task, and is potentially beneficial for teaching a range of task sequences. △ Less

Submitted 30 December, 2020; v1 submitted 17 July, 2020; originally announced July 2020.

Comments: 8 pages, 10 figures

arXiv:2007.08705 [pdf]

Verbal Focus-of-Attention System for Learning-from-Observation

Authors: Naoki Wake, Iori Yanokura, Kazuhiro Sasabuchi, Katsushi Ikeuchi

Abstract: The learning-from-observation (LfO) framework aims to map human demonstrations to a robot to reduce programming effort. To this end, an LfO system encodes a human demonstration into a series of execution units for a robot, which are referred to as task models. Although previous research has proposed successful task-model encoders, there has been little discussion on how to guide a task-model encod… ▽ More The learning-from-observation (LfO) framework aims to map human demonstrations to a robot to reduce programming effort. To this end, an LfO system encodes a human demonstration into a series of execution units for a robot, which are referred to as task models. Although previous research has proposed successful task-model encoders, there has been little discussion on how to guide a task-model encoder in a scene with spatio-temporal noises, such as cluttered objects or unrelated human body movements. Inspired by the function of verbal instructions guiding an observer's visual attention, we propose a verbal focus-of-attention (FoA) system (i.e., spatio-temporal filters) to guide a task-model encoder. For object manipulation, the system first recognizes the name of a target object and its attributes from verbal instructions. The information serves as a where-to-look FoA filter to confine the areas in which the target object existed in the demonstration. The system then detects the timings of grasp and release that occurred in the filtered areas. The timings serve as a when-to-look FoA filter to confine the period of object manipulation. Finally, a task-model encoder recognizes the task models by employing FoA filters. We demonstrate the robustness of the verbal FoA in attenuating spatio-temporal noises by comparing it with an existing action localization network. The contributions of this study are as follows: (1) to propose a verbal FoA for LfO, (2) to design an algorithm to calculate FoA filters from verbal input, and (3) to demonstrate the effectiveness of a verbal FoA in localizing an action by comparing it with a state-of-the-art vision system. △ Less

Submitted 24 March, 2021; v1 submitted 16 July, 2020; originally announced July 2020.

Comments: 8 pages, 7 figures. Submitted to and accepted by IEEE ICRA 2021. Last updated March 3rd, 2021

Showing 1–29 of 29 results for author: Sasabuchi, K