-
IK Seed Generator for Dual-Arm Human-like Physicality Robot with Mobile Base
Authors:
Jun Takamatsu,
Atsushi Kanehira,
Kazuhiro Sasabuchi,
Naoki Wake,
Katsushi Ikeuchi
Abstract:
Robots are strongly expected as a means of replacing human tasks. If a robot has a human-like physicality, the possibility of replacing human tasks increases. In the case of household service robots, it is desirable for them to be on a human-like size so that they do not become excessively large in order to coexist with humans in their operating environment. However, robots with size limitations t…
▽ More
Robots are strongly expected as a means of replacing human tasks. If a robot has a human-like physicality, the possibility of replacing human tasks increases. In the case of household service robots, it is desirable for them to be on a human-like size so that they do not become excessively large in order to coexist with humans in their operating environment. However, robots with size limitations tend to have difficulty solving inverse kinematics (IK) due to mechanical limitations, such as joint angle limitations. Conversely, if the difficulty coming from this limitation could be mitigated, one can expect that the use of such robots becomes more valuable. In numerical IK solver, which is commonly used for robots with higher degrees-of-freedom (DOF), the solvability of IK depends on the initial guess given to the solver. Thus, this paper proposes a method for generating a good initial guess for a numerical IK solver given the target hand configuration. For the purpose, we define the goodness of an initial guess using the scaled Jacobian matrix, which can calculate the manipulability index considering the joint limits. These two factors are related to the difficulty of solving IK. We generate the initial guess by optimizing the goodness using the genetic algorithm (GA). To enumerate much possible IK solutions, we use the reachability map that represents the reachable area of the robot hand in the arm-base coordinate system. We conduct quantitative evaluation and prove that using an initial guess that is judged to be better using the goodness value increases the probability that IK is solved. Finally, as an application of the proposed method, we show that by generating good initial guesses for IK a robot actually achieves three typical scenarios.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
RL-Driven Data Generation for Robust Vision-Based Dexterous Grasping
Authors:
Atsushi Kanehira,
Naoki Wake,
Kazuhiro Sasabuchi,
Jun Takamatsu,
Katsushi Ikeuchi
Abstract:
This work presents reinforcement learning (RL)-driven data augmentation to improve the generalization of vision-action (VA) models for dexterous grasping. While real-to-sim-to-real frameworks, where a few real demonstrations seed large-scale simulated data, have proven effective for VA models, applying them to dexterous settings remains challenging: obtaining stable multi-finger contacts is nontri…
▽ More
This work presents reinforcement learning (RL)-driven data augmentation to improve the generalization of vision-action (VA) models for dexterous grasping. While real-to-sim-to-real frameworks, where a few real demonstrations seed large-scale simulated data, have proven effective for VA models, applying them to dexterous settings remains challenging: obtaining stable multi-finger contacts is nontrivial across diverse object shapes. To address this, we leverage RL to generate contact-rich grasping data across varied geometries. In line with the real-to-sim-to-real paradigm, the grasp skill is formulated as a parameterized and tunable reference trajectory refined by a residual policy learned via RL. This modular design enables trajectory-level control that is both consistent with real demonstrations and adaptable to diverse object geometries. A vision-conditioned policy trained on simulation-augmented data demonstrates strong generalization to unseen objects, highlighting the potential of our approach to alleviate the data bottleneck in training VA models.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
A Taxonomy of Self-Handover
Authors:
Naoki Wake,
Atsushi Kanehira,
Kazuhiro Sasabuchi,
Jun Takamatsu,
Katsushi Ikeuchi
Abstract:
Self-handover, transferring an object between one's own hands, is a common but understudied bimanual action. While it facilitates seamless transitions in complex tasks, the strategies underlying its execution remain largely unexplored. Here, we introduce the first systematic taxonomy of self-handover, derived from manual annotation of over 12 hours of cooking activity performed by 21 participants.…
▽ More
Self-handover, transferring an object between one's own hands, is a common but understudied bimanual action. While it facilitates seamless transitions in complex tasks, the strategies underlying its execution remain largely unexplored. Here, we introduce the first systematic taxonomy of self-handover, derived from manual annotation of over 12 hours of cooking activity performed by 21 participants. Our analysis reveals that self-handover is not merely a passive transition, but a highly coordinated action involving anticipatory adjustments by both hands. As a step toward automated analysis of human manipulation, we further demonstrate the feasibility of classifying self-handover types using a state-of-the-art vision-language model. These findings offer fresh insights into bimanual coordination, underscoring the role of self-handover in enabling smooth task transitions-an ability essential for adaptive dual-arm robotics.
△ Less
Submitted 8 April, 2025; v1 submitted 7 April, 2025;
originally announced April 2025.
-
Plan-and-Act using Large Language Models for Interactive Agreement
Authors:
Kazuhiro Sasabuchi,
Naoki Wake,
Atsushi Kanehira,
Jun Takamatsu,
Katsushi Ikeuchi
Abstract:
Recent large language models (LLMs) are capable of planning robot actions. In this paper, we explore how LLMs can be used for planning actions with tasks involving situational human-robot interaction (HRI). A key problem of applying LLMs in situational HRI is balancing between "respecting the current human's activity" and "prioritizing the robot's task," as well as understanding the timing of when…
▽ More
Recent large language models (LLMs) are capable of planning robot actions. In this paper, we explore how LLMs can be used for planning actions with tasks involving situational human-robot interaction (HRI). A key problem of applying LLMs in situational HRI is balancing between "respecting the current human's activity" and "prioritizing the robot's task," as well as understanding the timing of when to use the LLM to generate an action plan. In this paper, we propose a necessary plan-and-act skill design to solve the above problems. We show that a critical factor for enabling a robot to switch between passive / active interaction behavior is to provide the LLM with an action text about the current robot's action. We also show that a second-stage question to the LLM (about the next timing to call the LLM) is necessary for planning actions at an appropriate timing. The skill design is applied to an Engage skill and is tested on four distinct interaction scenarios. We show that by using the skill design, LLMs can be leveraged to easily scale to different HRI scenarios with a reasonable success rate reaching 90% on the test scenarios.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
Agreeing to Interact in Human-Robot Interaction using Large Language Models and Vision Language Models
Authors:
Kazuhiro Sasabuchi,
Naoki Wake,
Atsushi Kanehira,
Jun Takamatsu,
Katsushi Ikeuchi
Abstract:
In human-robot interaction (HRI), the beginning of an interaction is often complex. Whether the robot should communicate with the human is dependent on several situational factors (e.g., the current human's activity, urgency of the interaction, etc.). We test whether large language models (LLM) and vision language models (VLM) can provide solutions to this problem. We compare four different system…
▽ More
In human-robot interaction (HRI), the beginning of an interaction is often complex. Whether the robot should communicate with the human is dependent on several situational factors (e.g., the current human's activity, urgency of the interaction, etc.). We test whether large language models (LLM) and vision language models (VLM) can provide solutions to this problem. We compare four different system-design patterns using LLMs and VLMs, and test on a test set containing 84 human-robot situations. The test set mixes several publicly available datasets and also includes situations where the appropriate action to take is open-ended. Our results using the GPT-4o and Phi-3 Vision model indicate that LLMs and VLMs are capable of handling interaction beginnings when the desired actions are clear, however, challenge remains in the open-ended situations where the model must balance between the human and robot situation.
△ Less
Submitted 7 January, 2025;
originally announced March 2025.
-
VLM-driven Behavior Tree for Context-aware Task Planning
Authors:
Naoki Wake,
Atsushi Kanehira,
Jun Takamatsu,
Kazuhiro Sasabuchi,
Katsushi Ikeuchi
Abstract:
The use of Large Language Models (LLMs) for generating Behavior Trees (BTs) has recently gained attention in the robotics community, yet remains in its early stages of development. In this paper, we propose a novel framework that leverages Vision-Language Models (VLMs) to interactively generate and edit BTs that address visual conditions, enabling context-aware robot operations in visually complex…
▽ More
The use of Large Language Models (LLMs) for generating Behavior Trees (BTs) has recently gained attention in the robotics community, yet remains in its early stages of development. In this paper, we propose a novel framework that leverages Vision-Language Models (VLMs) to interactively generate and edit BTs that address visual conditions, enabling context-aware robot operations in visually complex environments. A key feature of our approach lies in the conditional control through self-prompted visual conditions. Specifically, the VLM generates BTs with visual condition nodes, where conditions are expressed as free-form text. Another VLM process integrates the text into its prompt and evaluates the conditions against real-world images during robot execution. We validated our framework in a real-world cafe scenario, demonstrating both its feasibility and limitations.
△ Less
Submitted 10 January, 2025; v1 submitted 7 January, 2025;
originally announced January 2025.
-
Modality-Driven Design for Multi-Step Dexterous Manipulation: Insights from Neuroscience
Authors:
Naoki Wake,
Atsushi Kanehira,
Daichi Saito,
Jun Takamatsu,
Kazuhiro Sasabuchi,
Hideki Koike,
Katsushi Ikeuchi
Abstract:
Multi-step dexterous manipulation is a fundamental skill in household scenarios, yet remains an underexplored area in robotics. This paper proposes a modular approach, where each step of the manipulation process is addressed with dedicated policies based on effective modality input, rather than relying on a single end-to-end model. To demonstrate this, a dexterous robotic hand performs a manipulat…
▽ More
Multi-step dexterous manipulation is a fundamental skill in household scenarios, yet remains an underexplored area in robotics. This paper proposes a modular approach, where each step of the manipulation process is addressed with dedicated policies based on effective modality input, rather than relying on a single end-to-end model. To demonstrate this, a dexterous robotic hand performs a manipulation task involving picking up and rotating a box. Guided by insights from neuroscience, the task is decomposed into three sub-skills, 1)reaching, 2)grasping and lifting, and 3)in-hand rotation, based on the dominant sensory modalities employed in the human brain. Each sub-skill is addressed using distinct methods from a practical perspective: a classical controller, a Vision-Language-Action model, and a reinforcement learning policy with force feedback, respectively. We tested the pipeline on a real robot to demonstrate the feasibility of our approach. The key contribution of this study lies in presenting a neuroscience-inspired, modality-driven methodology for multi-step dexterous manipulation.
△ Less
Submitted 15 December, 2024;
originally announced December 2024.
-
Open-Vocabulary Action Localization with Iterative Visual Prompting
Authors:
Naoki Wake,
Atsushi Kanehira,
Kazuhiro Sasabuchi,
Jun Takamatsu,
Katsushi Ikeuchi
Abstract:
Video action localization aims to find the timings of specific actions from a long video. Although existing learning-based approaches have been successful, they require annotating videos, which comes with a considerable labor cost. This paper proposes a training-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLMs). The challenge stems from the fact that VLMs…
▽ More
Video action localization aims to find the timings of specific actions from a long video. Although existing learning-based approaches have been successful, they require annotating videos, which comes with a considerable labor cost. This paper proposes a training-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLMs). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames and create a concatenated image with frame index labels, allowing a VLM to identify the frames that most likely correspond to the start and end of the action. By iteratively narrowing the sampling window around the selected frames, the estimation gradually converges to more precise temporal boundaries. We demonstrate that this technique yields reasonable performance, achieving results comparable to state-of-the-art zero-shot action localization. These results support the use of VLMs as a practical tool for understanding videos. Sample code is available at https://microsoft.github.io/VLM-Video-Action-Localization/
△ Less
Submitted 7 April, 2025; v1 submitted 30 August, 2024;
originally announced August 2024.
-
APriCoT: Action Primitives based on Contact-state Transition for In-Hand Tool Manipulation
Authors:
Daichi Saito,
Atsushi Kanehira,
Kazuhiro Sasabuchi,
Naoki Wake,
Jun Takamatsu,
Hideki Koike,
Katsushi Ikeuchi
Abstract:
In-hand tool manipulation is an operation that not only manipulates a tool within the hand (i.e., in-hand manipulation) but also achieves a grasp suitable for a task after the manipulation. This study aims to achieve an in-hand tool manipulation skill through deep reinforcement learning. The difficulty of learning the skill arises because this manipulation requires (A) exploring long-term contact-…
▽ More
In-hand tool manipulation is an operation that not only manipulates a tool within the hand (i.e., in-hand manipulation) but also achieves a grasp suitable for a task after the manipulation. This study aims to achieve an in-hand tool manipulation skill through deep reinforcement learning. The difficulty of learning the skill arises because this manipulation requires (A) exploring long-term contact-state changes to achieve the desired grasp and (B) highly-varied motions depending on the contact-state transition. (A) leads to a sparsity of a reward on a successful grasp, and (B) requires an RL agent to explore widely within the state-action space to learn highly-varied actions, leading to sample inefficiency. To address these issues, this study proposes Action Primitives based on Contact-state Transition (APriCoT). APriCoT decomposes the manipulation into short-term action primitives by describing the operation as a contact-state transition based on three action representations (detach, crossover, attach). In each action primitive, fingers are required to perform short-term and similar actions. By training a policy for each primitive, we can mitigate the issues from (A) and (B). This study focuses on a fundamental operation as an example of in-hand tool manipulation: rotating an elongated object grasped with a precision grasp by half a turn to achieve the initial grasp. Experimental results demonstrated that ours succeeded in both the rotation and the achievement of the desired grasp, unlike existing studies. Additionally, it was found that the policy was robust to changes in object shape.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Robotic Stroke Motion Following the Shape of the Human Back: Motion Generation and Psychological Effects
Authors:
Akishige Yuguchi,
Tomoki Ishikura,
Sung-Gwi Cho,
Jun Takamatsu,
Tsukasa Ogasawara
Abstract:
In this study, to perform the robotic stroke motions following the shape of the human back similar to the stroke motions by humans, in contrast to the conventional robotic stroke motion with a linear trajectory, we propose a trajectory generation method for a robotic stroke motion following the shape of the human back. We confirmed that the accuracy of the method's trajectory was close to that of…
▽ More
In this study, to perform the robotic stroke motions following the shape of the human back similar to the stroke motions by humans, in contrast to the conventional robotic stroke motion with a linear trajectory, we propose a trajectory generation method for a robotic stroke motion following the shape of the human back. We confirmed that the accuracy of the method's trajectory was close to that of the actual stroking motion by a human. Furthermore, we conducted a subjective experiment to evaluate the psychological effects of the proposed stroke motion in contrast to those of the conventional stroke motion with a linear trajectory. The experimental results showed that the actual stroke motion following the shape of the human back tended to evoke more pleasant and active feelings than the conventional stroke motion.
△ Less
Submitted 10 May, 2024;
originally announced May 2024.
-
Designing Library of Skill-Agents for Hardware-Level Reusability
Authors:
Jun Takamatsu,
Daichi Saito,
Katsushi Ikeuchi,
Atsushi Kanehira,
Kazuhiro Sasabuchi,
Naoki Wake
Abstract:
To use new robot hardware in a new environment, it is necessary to develop a control program tailored to that specific robot in that environment. Considering the reusability of software among robots is crucial to minimize the effort involved in this process and maximize software reuse across different robots in different environments. This paper proposes a method to remedy this process by consider…
▽ More
To use new robot hardware in a new environment, it is necessary to develop a control program tailored to that specific robot in that environment. Considering the reusability of software among robots is crucial to minimize the effort involved in this process and maximize software reuse across different robots in different environments. This paper proposes a method to remedy this process by considering hardware-level reusability, using Learning-from-observation (LfO) paradigm with a pre-designed skill-agent library. The LfO framework represents the required actions in hardware-independent representations, referred to as task models, from observing human demonstrations, capturing the necessary parameters for the interaction between the environment and the robot. When executing the desired actions from the task models, a set of skill agents is employed to convert the representations into robot commands. This paper focuses on the latter part of the LfO framework, utilizing the set to generate robot actions from the task models, and explores a hardware-independent design approach for these skill agents. These skill agents are described in a hardware-independent manner, considering the relative relationship between the robot's hand position and the environment. As a result, it is possible to execute these actions on robots with different hardware configurations by simply swapping the inverse kinematics solver. This paper, first, defines a necessary and sufficient skill-agent set corresponding to cover all possible actions, and considers the design principles for these skill agents in the library. We provide concrete examples of such skill agents and demonstrate the practicality of using these skill agents by showing that the same representations can be executed on two different robots, Nextage and Fetch, using the proposed skill-agents set.
△ Less
Submitted 20 March, 2024; v1 submitted 4 March, 2024;
originally announced March 2024.
-
GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration
Authors:
Naoki Wake,
Atsushi Kanehira,
Kazuhiro Sasabuchi,
Jun Takamatsu,
Katsushi Ikeuchi
Abstract:
We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), to facilitate one-shot visual teaching for robotic manipulation. This system analyzes videos of humans performing tasks and outputs executable robot programs that incorporate insights into affordances. The process begins with GPT-4V analyzing the videos to obtain textual explanations of environmental and…
▽ More
We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), to facilitate one-shot visual teaching for robotic manipulation. This system analyzes videos of humans performing tasks and outputs executable robot programs that incorporate insights into affordances. The process begins with GPT-4V analyzing the videos to obtain textual explanations of environmental and action details. A GPT-4-based task planner then encodes these details into a symbolic task plan. Subsequently, vision systems spatially and temporally ground the task plan in the videos. Objects are identified using an open-vocabulary object detector, and hand-object interactions are analyzed to pinpoint moments of grasping and releasing. This spatiotemporal grounding allows for the gathering of affordance information (e.g., grasp types, waypoints, and body postures) critical for robot execution. Experiments across various scenarios demonstrate the method's efficacy in enabling real robots to operate from one-shot human demonstrations. Meanwhile, quantitative tests have revealed instances of hallucination in GPT-4V, highlighting the importance of incorporating human supervision within the pipeline. The prompts of GPT-4V/GPT-4 are available at this project page: https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/
△ Less
Submitted 26 September, 2024; v1 submitted 20 November, 2023;
originally announced November 2023.
-
Constraint-aware Policy for Compliant Manipulation
Authors:
Daichi Saito,
Kazuhiro Sasabuchi,
Naoki Wake,
Atsushi Kanehira,
Jun Takamatsu,
Hideki Koike,
Katsushi Ikeuchi
Abstract:
Robot manipulation in a physically-constrained environment requires compliant manipulation. Compliant manipulation is a manipulation skill to adjust hand motion based on the force imposed by the environment. Recently, reinforcement learning (RL) has been applied to solve household operations involving compliant manipulation. However, previous RL methods have primarily focused on designing a policy…
▽ More
Robot manipulation in a physically-constrained environment requires compliant manipulation. Compliant manipulation is a manipulation skill to adjust hand motion based on the force imposed by the environment. Recently, reinforcement learning (RL) has been applied to solve household operations involving compliant manipulation. However, previous RL methods have primarily focused on designing a policy for a specific operation that limits their applicability and requires separate training for every new operation. We propose a constraint-aware policy that is applicable to various unseen manipulations by grouping several manipulations together based on the type of physical constraint involved. The type of physical constraint determines the characteristic of the imposed force direction; thus, a generalized policy is trained in the environment and reward designed on the basis of this characteristic. This paper focuses on two types of physical constraints: prismatic and revolute joints. Experiments demonstrated that the same policy could successfully execute various compliant-manipulation operations, both in the simulation and reality. We believe this study is the first step toward realizing a generalized household-robot.
△ Less
Submitted 18 November, 2023;
originally announced November 2023.
-
Bias in Emotion Recognition with ChatGPT
Authors:
Naoki Wake,
Atsushi Kanehira,
Kazuhiro Sasabuchi,
Jun Takamatsu,
Katsushi Ikeuchi
Abstract:
This technical report explores the ability of ChatGPT in recognizing emotions from text, which can be the basis of various applications like interactive chatbots, data annotation, and mental health analysis. While prior research has shown ChatGPT's basic ability in sentiment analysis, its performance in more nuanced emotion recognition is not yet explored. Here, we conducted experiments to evaluat…
▽ More
This technical report explores the ability of ChatGPT in recognizing emotions from text, which can be the basis of various applications like interactive chatbots, data annotation, and mental health analysis. While prior research has shown ChatGPT's basic ability in sentiment analysis, its performance in more nuanced emotion recognition is not yet explored. Here, we conducted experiments to evaluate its performance of emotion recognition across different datasets and emotion labels. Our findings indicate a reasonable level of reproducibility in its performance, with noticeable improvement through fine-tuning. However, the performance varies with different emotion labels and datasets, highlighting an inherent instability and possible bias. The choice of dataset and emotion labels significantly impacts ChatGPT's emotion recognition performance. This paper sheds light on the importance of dataset and label selection, and the potential of fine-tuning in enhancing ChatGPT's emotion recognition capabilities, providing a groundwork for better integration of emotion analysis in applications using ChatGPT.
△ Less
Submitted 4 December, 2023; v1 submitted 18 October, 2023;
originally announced October 2023.
-
GPT Models Meet Robotic Applications: Co-Speech Gesturing Chat System
Authors:
Naoki Wake,
Atsushi Kanehira,
Kazuhiro Sasabuchi,
Jun Takamatsu,
Katsushi Ikeuchi
Abstract:
This technical paper introduces a chatting robot system that utilizes recent advancements in large-scale language models (LLMs) such as GPT-3 and ChatGPT. The system is integrated with a co-speech gesture generation system, which selects appropriate gestures based on the conceptual meaning of speech. Our motivation is to explore ways of utilizing the recent progress in LLMs for practical robotic a…
▽ More
This technical paper introduces a chatting robot system that utilizes recent advancements in large-scale language models (LLMs) such as GPT-3 and ChatGPT. The system is integrated with a co-speech gesture generation system, which selects appropriate gestures based on the conceptual meaning of speech. Our motivation is to explore ways of utilizing the recent progress in LLMs for practical robotic applications, which benefits the development of both chatbots and LLMs. Specifically, it enables the development of highly responsive chatbot systems by leveraging LLMs and adds visual effects to the user interface of LLMs as an additional value. The source code for the system is available on GitHub for our in-house robot (https://github.com/microsoft/LabanotationSuite/tree/master/MSRAbotChatSimulation) and GitHub for Toyota HSR (https://github.com/microsoft/GPT-Enabled-HSR-CoSpeechGestures).
△ Less
Submitted 10 May, 2023;
originally announced June 2023.
-
Applying Learning-from-observation to household service robots: three common-sense formulation
Authors:
Katsushi Ikeuchi,
Jun Takamatsu,
Kazuhiro Sasabuchi,
Naoki Wake,
Atsushi Kanehiro
Abstract:
Utilizing a robot in a new application requires the robot to be programmed at each time. To reduce such programmings efforts, we have been developing ``Learning-from-observation (LfO)'' that automatically generates robot programs by observing human demonstrations. One of the main issues with introducing this LfO system into the domain of household tasks is the cluttered environments, which cause d…
▽ More
Utilizing a robot in a new application requires the robot to be programmed at each time. To reduce such programmings efforts, we have been developing ``Learning-from-observation (LfO)'' that automatically generates robot programs by observing human demonstrations. One of the main issues with introducing this LfO system into the domain of household tasks is the cluttered environments, which cause difficulty in determining which elements are important for task execution when observing demonstrations. To overcome this issue, it is necessary for the system to have common sense shared with the human demonstrator. This paper addresses three relationships that LfO in the household domain should focus on when observing demonstrations and proposes representations to describe the common sense used by the demonstrator for optimal execution of task sequences. Specifically, the paper proposes to use labanotation to describe the postures between the environment and the robot, contact-webs to describe the grasping methods between the robot and the tool, and physical and semantic constraints to describe the motions between the tool and the environment. Then, based on these representations, the paper formulates task models, machine-independent robot programs, that indicate what to do and how to do. Third, the paper explains the task encoder to obtain task models and task decoder to execute the task models on the robot hardware. Finally, this paper presents how the system actually works through several example scenes.
△ Less
Submitted 19 April, 2023;
originally announced April 2023.
-
Efficiently Collecting Training Dataset for 2D Object Detection by Online Visual Feedback
Authors:
Takuya Kiyokawa,
Naoki Shirakura,
Hiroki Katayama,
Keita Tomochika,
Jun Takamatsu
Abstract:
Training deep-learning-based vision systems require the manual annotation of a significant number of images. Such manual annotation is highly time-consuming and labor-intensive. Although previous studies have attempted to eliminate the effort required for annotation, the effort required for image collection was retained. To address this, we propose a human-in-the-loop dataset collection method tha…
▽ More
Training deep-learning-based vision systems require the manual annotation of a significant number of images. Such manual annotation is highly time-consuming and labor-intensive. Although previous studies have attempted to eliminate the effort required for annotation, the effort required for image collection was retained. To address this, we propose a human-in-the-loop dataset collection method that uses a web application. To counterbalance the workload and performance by encouraging the collection of multi-view object image datasets in an enjoyable manner, thereby amplifying motivation, we propose three types of online visual feedback features to track the progress of the collection status. Our experiments thoroughly investigated the impact of each feature on collection performance and quality of operation. The results suggested the feasibility of annotation and object detection.
△ Less
Submitted 6 November, 2024; v1 submitted 10 April, 2023;
originally announced April 2023.
-
ChatGPT Empowered Long-Step Robot Control in Various Environments: A Case Application
Authors:
Naoki Wake,
Atsushi Kanehira,
Kazuhiro Sasabuchi,
Jun Takamatsu,
Katsushi Ikeuchi
Abstract:
This paper demonstrates how OpenAI's ChatGPT can be used in a few-shot setting to convert natural language instructions into a sequence of executable robot actions. The paper proposes easy-to-customize input prompts for ChatGPT that meet common requirements in practical applications, such as easy integration with robot execution systems and applicability to various environments while minimizing th…
▽ More
This paper demonstrates how OpenAI's ChatGPT can be used in a few-shot setting to convert natural language instructions into a sequence of executable robot actions. The paper proposes easy-to-customize input prompts for ChatGPT that meet common requirements in practical applications, such as easy integration with robot execution systems and applicability to various environments while minimizing the impact of ChatGPT's token limit. The prompts encourage ChatGPT to output a sequence of predefined robot actions, represent the operating environment in a formalized style, and infer the updated state of the operating environment. Experiments confirmed that the proposed prompts enable ChatGPT to act according to requirements in various environments, and users can adjust ChatGPT's output with natural language feedback for safe and robust operation. The proposed prompts and source code are open-source and publicly available at https://github.com/microsoft/ChatGPT-Robot-Manipulation-Prompts
△ Less
Submitted 29 August, 2023; v1 submitted 7 April, 2023;
originally announced April 2023.
-
Task-sequencing Simulator: Integrated Machine Learning to Execution Simulation for Robot Manipulation
Authors:
Kazuhiro Sasabuchi,
Daichi Saito,
Atsushi Kanehira,
Naoki Wake,
Jun Takamatsu,
Katsushi Ikeuchi
Abstract:
A task-sequencing simulator in robotics manipulation to integrate simulation-for-learning and simulation-for-execution is introduced. Unlike existing machine-learning simulation where a non-decomposed simulation is used to simulate a training scenario, the task-sequencing simulator runs a composed simulation using building blocks. This way, the simulation-for-learning is structured similarly to a…
▽ More
A task-sequencing simulator in robotics manipulation to integrate simulation-for-learning and simulation-for-execution is introduced. Unlike existing machine-learning simulation where a non-decomposed simulation is used to simulate a training scenario, the task-sequencing simulator runs a composed simulation using building blocks. This way, the simulation-for-learning is structured similarly to a multi-step simulation-for-execution. To compose both learning and execution scenarios, a unified trainable-and-composable description of blocks called a concept model is proposed and used. Using the simulator design and concept models, a reusable simulator for learning different tasks, a common-ground system for learning-to-execution, simulation-to-real is achieved and shown.
△ Less
Submitted 3 January, 2023;
originally announced January 2023.
-
Interactive Task Encoding System for Learning-from-Observation
Authors:
Naoki Wake,
Atsushi Kanehira,
Kazuhiro Sasabuchi,
Jun Takamatsu,
Katsushi Ikeuchi
Abstract:
We present the Interactive Task Encoding System (ITES) for teaching robots to perform manipulative tasks. ITES is designed as an input system for the Learning-from-Observation (LfO) framework, which enables household robots to be programmed using few-shot human demonstrations without the need for coding. In contrast to previous LfO systems that rely solely on visual demonstrations, ITES leverages…
▽ More
We present the Interactive Task Encoding System (ITES) for teaching robots to perform manipulative tasks. ITES is designed as an input system for the Learning-from-Observation (LfO) framework, which enables household robots to be programmed using few-shot human demonstrations without the need for coding. In contrast to previous LfO systems that rely solely on visual demonstrations, ITES leverages both verbal instructions and interaction to enhance recognition robustness, thus enabling multimodal LfO. ITES identifies tasks from verbal instructions and extracts parameters from visual demonstrations. Meanwhile, the recognition result was reviewed by the user for interactive correction. Evaluations conducted on a real robot demonstrate the successful teaching of multiple operations for several scenarios, suggesting the usefulness of ITES for multimodal LfO. The source code is available at https://github.com/microsoft/symbolic-robot-teaching-interface.
△ Less
Submitted 28 April, 2023; v1 submitted 21 December, 2022;
originally announced December 2022.
-
Learning-from-Observation System Considering Hardware-Level Reusability
Authors:
Jun Takamatsu,
Kazuhiro Sasabuchi,
Naoki Wake,
Atsushi Kanehira,
Katsushi Ikeuchi
Abstract:
Robot developers develop various types of robots for satisfying users' various demands. Users' demands are related to their backgrounds and robots suitable for users may vary. If a certain developer would offer a robot that is different from the usual to a user, the robot-specific software has to be changed. On the other hand, robot-software developers would like to reuse their developed software…
▽ More
Robot developers develop various types of robots for satisfying users' various demands. Users' demands are related to their backgrounds and robots suitable for users may vary. If a certain developer would offer a robot that is different from the usual to a user, the robot-specific software has to be changed. On the other hand, robot-software developers would like to reuse their developed software as much as possible to reduce their efforts. We propose the system design considering hardware-level reusability. For this purpose, we begin with the learning-from-observation framework. This framework represents a target task in robot-agnostic representation, and thus the represented task description can be shared with various robots. When executing the task, it is necessary to convert the robot-agnostic description into commands of a target robot. To increase the reusability, first, we implement the skill library, robot motion primitives, only considering a robot hand and we regarded that a robot was just a carrier to move the hand on the target trajectory. The skill library is reusable if we would like to the same robot hand. Second, we employ the generic IK solver to quickly swap a robot. We verify the hardware-level reusability by applying two task descriptions to two different robots, Nextage and Fetch.
△ Less
Submitted 18 December, 2022;
originally announced December 2022.
-
Design strategies for controlling neuron-connected robots using reinforcement learning
Authors:
Haruto Sawada,
Naoki Wake,
Kazuhiro Sasabuchi,
Jun Takamatsu,
Hirokazu Takahashi,
Katsushi Ikeuchi
Abstract:
Despite the growing interest in robot control utilizing the computation of biological neurons, context-dependent behavior by neuron-connected robots remains a challenge. Context-dependent behavior here is defined as behavior that is not the result of a simple sensory-motor coupling, but rather based on an understanding of the task goal. This paper proposes design principles for training neuron-con…
▽ More
Despite the growing interest in robot control utilizing the computation of biological neurons, context-dependent behavior by neuron-connected robots remains a challenge. Context-dependent behavior here is defined as behavior that is not the result of a simple sensory-motor coupling, but rather based on an understanding of the task goal. This paper proposes design principles for training neuron-connected robots based on task goals to achieve context-dependent behavior. First, we employ deep reinforcement learning (RL) to enable training that accounts for goal achievements. Second, we propose a neuron simulator as a probability distribution based on recorded neural data, aiming to represent physiologically valid neural dynamics while avoiding complex modeling with high computational costs. Furthermore, we propose to update the simulators during the training to bridge the gap between the simulation and the real settings. The experiments showed that the robot gradually learned context-dependent behaviors in pole balancing and robot navigation tasks. Moreover, the learned policies were valid for neural simulators based on novel neural data, and the task performance increased by updating the simulators during training. These results suggest the effectiveness of the proposed design principle for the context-dependent behavior of neuron-connected robots.
△ Less
Submitted 29 March, 2022;
originally announced March 2022.
-
Task-grasping from human demonstration
Authors:
Daichi Saito,
Kazuhiro Sasabuchi,
Naoki Wake,
Jun Takamatsu,
Hideki Koike,
Katsushi Ikeuchi
Abstract:
A challenge in robot grasping is to achieve task-grasping which is to select a grasp that is advantageous to the success of tasks before and after grasps. One of the frameworks to address this difficulty is Learning-from-Observation (LfO), which obtains various hints from human demonstrations. This paper solves three issues in the grasping skills in the LfO framework: 1) how to functionally mimic…
▽ More
A challenge in robot grasping is to achieve task-grasping which is to select a grasp that is advantageous to the success of tasks before and after grasps. One of the frameworks to address this difficulty is Learning-from-Observation (LfO), which obtains various hints from human demonstrations. This paper solves three issues in the grasping skills in the LfO framework: 1) how to functionally mimic human-demonstrated grasps to robots with limited grasp capability, 2) how to coordinate grasp skills with reaching body mimicking, 3) how to robustly perform grasps under object pose and shape uncertainty. A deep reinforcement learning using contact-web based rewards and domain randomization of approach directions is proposed to achieve such robust mimicked grasping skills. Experiment results show that the trained grasping skills can be applied in an LfO system and executed on a real robot. In addition, it is shown that the trained skill is robust to errors in the object pose and to the uncertainty of the object shape and can be combined with various reach-coordination.
△ Less
Submitted 1 March, 2022;
originally announced March 2022.
-
Active Vapor-Based Robotic Wiper
Authors:
Takuya Kiyokawa,
Hiroki Katayama,
Jun Takamatsu,
Kensuke Harada
Abstract:
This paper presents a method for estimating normals of mirrors and transparent objects challenging for cameras to recognize. We propose spraying water vapor onto mirror or transparent surfaces to create a diffuse reflective surface. Using an ultrasonic humidifier on a robotic arm, we apply water vapor to the target object's surface, forming a cross-shaped misted area. This creates partially diffus…
▽ More
This paper presents a method for estimating normals of mirrors and transparent objects challenging for cameras to recognize. We propose spraying water vapor onto mirror or transparent surfaces to create a diffuse reflective surface. Using an ultrasonic humidifier on a robotic arm, we apply water vapor to the target object's surface, forming a cross-shaped misted area. This creates partially diffuse reflective surfaces, enabling the camera to detect the target object's surface. Adjusting the gripper-mounted camera viewpoint maximizes the extracted misted area's appearance in the image, allowing normal estimation of the target surface. Experiments show the method's effectiveness, with RMSEs of azimuth estimation for mirrors and transparent glass at approximately 4.2 and 5.8 degrees, respectively. Our robot experiments demonstrated that our robotic wiper can perform contact-force-regulated wiping motions to clean a transparent window, akin to human performance.
△ Less
Submitted 6 November, 2024; v1 submitted 16 November, 2021;
originally announced November 2021.
-
Soft-Jig: A Flexible Sensing Jig for Simultaneously Fixing and Estimating Orientation of Assembly Parts
Authors:
Tatsuya Sakuma,
Takuya Kiyokawa,
Jun Takamatsu,
Takahiro Wada,
Tsukasa Ogasawara
Abstract:
For assembly tasks, it is essential to firmly fix target parts and to accurately estimate their poses. Several rigid jigs for individual parts are frequently used in assembly factories to achieve precise and time-efficient product assembly. However, providing customized jigs is time-consuming. In this study, to address the lack of versatility in the shapes the jigs can be used for, we developed a…
▽ More
For assembly tasks, it is essential to firmly fix target parts and to accurately estimate their poses. Several rigid jigs for individual parts are frequently used in assembly factories to achieve precise and time-efficient product assembly. However, providing customized jigs is time-consuming. In this study, to address the lack of versatility in the shapes the jigs can be used for, we developed a flexible jig with a soft membrane including transparent beads and oil with a tuned refractive index. The bead-based jamming transition was accomplished by discharging only oil enabling a part to be firmly fixed. Because the two cameras under the jig are able to capture membrane shape changes, we proposed a sensing method to estimate the orientation of the part based on the behaviors of markers created on the jig's inner surface. Through estimation experiments, the proposed system could estimate the orientation of a cylindrical object with a diameter larger than 50 mm and an RMSE of less than 3 degrees.
△ Less
Submitted 16 September, 2021; v1 submitted 15 September, 2021;
originally announced September 2021.
-
Robotic Waste Sorter with Agile Manipulation and Quickly Trainable Detector
Authors:
Takuya Kiyokawa,
Hiroki Katayama,
Yuya Tatsuta,
Jun Takamatsu,
Tsukasa Ogasawara
Abstract:
Owing to human labor shortages, the automation of labor-intensive manual waste-sorting is needed. The goal of automating waste-sorting is to replace the human role of robust detection and agile manipulation of waste items with robots. To achieve this, we propose three methods. First, we provide a combined manipulation method using graspless push-and-drop and pick-and-release manipulation. Second,…
▽ More
Owing to human labor shortages, the automation of labor-intensive manual waste-sorting is needed. The goal of automating waste-sorting is to replace the human role of robust detection and agile manipulation of waste items with robots. To achieve this, we propose three methods. First, we provide a combined manipulation method using graspless push-and-drop and pick-and-release manipulation. Second, we provide a robotic system that can automatically collect object images to quickly train a deep neural-network model. Third, we provide a method to mitigate the differences in the appearance of target objects from two scenes: one for dataset collection and the other for waste sorting in a recycling factory. If differences exist, the performance of a trained waste detector may decrease. We address differences in illumination and background by applying object scaling, histogram matching with histogram equalization, and background synthesis to the source target-object images. Via experiments in an indoor experimental workplace for waste-sorting, we confirm that the proposed methods enable quick collection of the training image sets for three classes of waste items (i.e., aluminum can, glass bottle, and plastic bottle) and detection with higher performance than the methods that do not consider the differences. We also confirm that the proposed method enables the robot quickly manipulate the objects.
△ Less
Submitted 4 September, 2021; v1 submitted 2 April, 2021;
originally announced April 2021.
-
Semantic constraints to represent common sense required in household actions for multi-modal Learning-from-observation robot
Authors:
Katsushi Ikeuchi,
Naoki Wake,
Riku Arakawa,
Kazuhiro Sasabuchi,
Jun Takamatsu
Abstract:
The paradigm of learning-from-observation (LfO) enables a robot to learn how to perform actions by observing human-demonstrated actions. Previous research in LfO have mainly focused on the industrial domain which only consist of the observable physical constraints between a manipulating tool and the robot's working environment. In order to extend this paradigm to the household domain which consist…
▽ More
The paradigm of learning-from-observation (LfO) enables a robot to learn how to perform actions by observing human-demonstrated actions. Previous research in LfO have mainly focused on the industrial domain which only consist of the observable physical constraints between a manipulating tool and the robot's working environment. In order to extend this paradigm to the household domain which consists non-observable constraints derived from a human's common sense; we introduce the idea of semantic constraints. The semantic constraints are represented similar to the physical constraints by defining a contact with an imaginary semantic environment. We thoroughly investigate the necessary and sufficient set of contact state and state transitions to understand the different types of physical and semantic constraints. We then apply our constraint representation to analyze various actions in top hit household YouTube videos and real home cooking recordings. We further categorize the frequently appearing constraint patterns into physical, semantic, and multistage task groups and verify that these groups are not only necessary but a sufficient set for covering standard household actions. Finally, we conduct a preliminary experiment using textual input to explore the possibilities of combining verbal and visual input for recognizing the task groups. Our results provide promising directions for incorporating common sense in the literature of robot teaching.
△ Less
Submitted 3 March, 2021;
originally announced March 2021.
-
Toward an Affective Touch Robot: Subjective and Physiological Evaluation of Gentle Stroke Motion Using a Human-Imitation Hand
Authors:
Tomoki Ishikura,
Akishige Yuguchi,
Yuki Kitamura,
Sung-Gwi Cho,
Ming Ding,
Jun Takamatsu,
Wataru Sato,
Sakiko Yoshikawa,
Tsukasa Ogasawara
Abstract:
Affective touch offers positive psychological and physiological benefits such as the mitigation of stress and pain. If a robot could realize human-like affective touch, it would open up new application areas, including supporting care work. In this research, we focused on the gentle stroking motion of a robot to evoke the same emotions that human touch would evoke: in other words, an affective tou…
▽ More
Affective touch offers positive psychological and physiological benefits such as the mitigation of stress and pain. If a robot could realize human-like affective touch, it would open up new application areas, including supporting care work. In this research, we focused on the gentle stroking motion of a robot to evoke the same emotions that human touch would evoke: in other words, an affective touch robot. We propose a robot that is able to gently stroke the back of a human using our designed human-imitation hand. To evaluate the emotional effects of this affective touch, we compared the results of a combination of two agents (the human-imitation hand and the human hand), at two stroke speeds (3 and 30 cm/s). The results of the subjective and physiological evaluations highlighted the following three findings: 1) the subjects evaluated strokes similarly with regard to the stroke speed of the human and human-imitation hand, in both the subjective and physiological evaluations; 2) the subjects felt greater pleasure and arousal at the faster stroke rate (30 cm/s rather than 3 cm/s); and 3) poorer fitting of the human-imitation hand due to the bending of the back had a negative emotional effect on the subjects.
△ Less
Submitted 8 December, 2020;
originally announced December 2020.
-
Assembly Sequences Based on Multiple Criteria Against Products with Deformable Parts
Authors:
Takuya Kiyokawa,
Jun Takamatsu,
Tsukasa Ogasawara
Abstract:
Aiming to generate easy-to-handle assembly sequences for robotic assembly, this study tackles assembly sequence generation by considering two tradeoff objectives: (1) insertion conditions and (2) degrees of constraints among assembled parts. We propose a multiobjective genetic algorithm to balance these two objectives for generating assembly sequences. Furthermore, the method of extracting part re…
▽ More
Aiming to generate easy-to-handle assembly sequences for robotic assembly, this study tackles assembly sequence generation by considering two tradeoff objectives: (1) insertion conditions and (2) degrees of constraints among assembled parts. We propose a multiobjective genetic algorithm to balance these two objectives for generating assembly sequences. Furthermore, the method of extracting part relation matrices including interference-free, insertion, and degree of constraint matrices is extended for application to 3D computer-aided design (CAD) models, including deformable parts. The interference of deformable parts with other parts can be easily investigated by scaling parts. A simulation experiment was conducted using the proposed method, and the results show the possibility of obtaining Pareto-optimal solutions of assembly sequences for a 3D CAD model with 33 parts including a deformable part. This approach can potentially be extended to handle various types of deformable parts and to explore graspable sequences during assembly operations.
△ Less
Submitted 2 April, 2021; v1 submitted 21 October, 2020;
originally announced October 2020.
-
Soft-Jig-Driven Assembly Operations
Authors:
Takuya Kiyokawa,
Tatsuya Sakuma,
Jun Takamatsu,
Tsukasa Ogasawara
Abstract:
To design a general-purpose assembly robot system that can handle objects of various shapes, we propose a soft jig that fits to the shapes of assembly parts. The functionality of the soft jig is based on a jamming gripper developed in the field of soft robotics. The soft jig has a bag covered with a malleable silicone membrane, which has high friction, elongation, and contraction rates for keeping…
▽ More
To design a general-purpose assembly robot system that can handle objects of various shapes, we propose a soft jig that fits to the shapes of assembly parts. The functionality of the soft jig is based on a jamming gripper developed in the field of soft robotics. The soft jig has a bag covered with a malleable silicone membrane, which has high friction, elongation, and contraction rates for keeping parts fixed. The bag is filled with glass beads to achieve a jamming transition. We propose a method to configure parts-fixing on the soft jig based on contact relations, reachable directions, and the center of gravity of the parts that are fixed on the jig. The usability of the soft jig was evaluated in terms of the fixing performance and versatility for various shapes and postures of parts.
△ Less
Submitted 24 March, 2021; v1 submitted 21 October, 2020;
originally announced October 2020.
-
A Learning-from-Observation Framework: One-Shot Robot Teaching for Grasp-Manipulation-Release Household Operations
Authors:
Naoki Wake,
Riku Arakawa,
Iori Yanokura,
Takuya Kiyokawa,
Kazuhiro Sasabuchi,
Jun Takamatsu,
Katsushi Ikeuchi
Abstract:
A household robot is expected to perform various manipulative operations with an understanding of the purpose of the task. To this end, a desirable robotic application should provide an on-site robot teaching framework for non-experts. Here we propose a Learning-from-Observation (LfO) framework for grasp-manipulation-release class household operations (GMR-operations). The framework maps human dem…
▽ More
A household robot is expected to perform various manipulative operations with an understanding of the purpose of the task. To this end, a desirable robotic application should provide an on-site robot teaching framework for non-experts. Here we propose a Learning-from-Observation (LfO) framework for grasp-manipulation-release class household operations (GMR-operations). The framework maps human demonstrations to predefined task models through one-shot teaching. Each task model contains both high-level knowledge regarding the geometric constraints and low-level knowledge related to human postures. The key idea is to design a task model that 1) covers various GMR-operations and 2) includes human postures to achieve tasks. We verify the applicability of our framework by testing an operational LfO system with a real robot. In addition, we quantify the coverage of the task model by analyzing online videos of household operations. In the context of one-shot robot teaching, the contribution of this study is a framework that 1) covers various GMR-operations and 2) mimics human postures during the operations.
△ Less
Submitted 20 October, 2020; v1 submitted 4 August, 2020;
originally announced August 2020.
-
Control of Walking Assist Exoskeleton with Time-delay Based on the Prediction of Plantar Force
Authors:
Ming Ding,
Mikihisa Nagashima,
Sung-Gwi Cho,
Jun Takamatsu,
Tsukasa Ogasawara
Abstract:
Many kinds of lower-limb exoskeletons were developed for walking assistance. However, when controlling these exoskeletons, time-delay due to the computation time and the communication delays is still a general problem. In this research, we propose a novel method to prevent the time-delay when controlling a walking assist exoskeleton by predicting the future plantar force and walking status. By usi…
▽ More
Many kinds of lower-limb exoskeletons were developed for walking assistance. However, when controlling these exoskeletons, time-delay due to the computation time and the communication delays is still a general problem. In this research, we propose a novel method to prevent the time-delay when controlling a walking assist exoskeleton by predicting the future plantar force and walking status. By using Long Short-Term Memory and a fully-connected network, the plantar force can be predicted using only data measured by inertial measurement unit sensors, not only during the walking period but also at the start and end of walking. From the predicted plantar force, the walking status and the desired assistance timing can also be determined. By considering the time-delay and sending the control commands beforehand, the exoskeleton can be moved precisely on the desired assistance timing. In experiments, the prediction accuracy of the plantar force and the assistance timing are confirmed. The performance of the proposed method is also evaluated by using the trained model to control the exoskeleton.
△ Less
Submitted 17 July, 2020; v1 submitted 1 July, 2020;
originally announced July 2020.
-
Multi-View Inpainting for RGB-D Sequence
Authors:
Feiran Li,
Gustavo Alfonso Garcia Ricardez,
Jun Takamatsu,
Tsukasa Ogasawara
Abstract:
In this work we propose a novel approach to remove undesired objects from RGB-D sequences captured with freely moving cameras, which enables static 3D reconstruction. Our method jointly uses existing information from multiple frames as well as generates new one via inpainting techniques. We use balanced rules to select source frames; local homography based image warping method for alignment and Ma…
▽ More
In this work we propose a novel approach to remove undesired objects from RGB-D sequences captured with freely moving cameras, which enables static 3D reconstruction. Our method jointly uses existing information from multiple frames as well as generates new one via inpainting techniques. We use balanced rules to select source frames; local homography based image warping method for alignment and Markov random field (MRF) based approach for combining existing information. For the left holes, we employ exemplar based multi-view inpainting method to deal with the color image and coherently use it as guidance to complete the depth correspondence. Experiments show that our approach is qualified for removing the undesired objects and inpainting the holes.
△ Less
Submitted 21 November, 2018;
originally announced November 2018.