Search | arXiv e-print repository

Close-Fitting Dressing Assistance Based on State Estimation of Feet and Garments with Semantic-based Visual Attention

Authors: Takuma Tsukakoshi, Tamon Miyake, Tetsuya Ogata, Yushi Wang, Takumi Akaishi, Shigeki Sugano

Abstract: As the population continues to age, a shortage of caregivers is expected in the future. Dressing assistance, in particular, is crucial for opportunities for social participation. Especially dressing close-fitting garments, such as socks, remains challenging due to the need for fine force adjustments to handle the friction or snagging against the skin, while considering the shape and position of th… ▽ More As the population continues to age, a shortage of caregivers is expected in the future. Dressing assistance, in particular, is crucial for opportunities for social participation. Especially dressing close-fitting garments, such as socks, remains challenging due to the need for fine force adjustments to handle the friction or snagging against the skin, while considering the shape and position of the garment. This study introduces a method uses multi-modal information including not only robot's camera images, joint angles, joint torques, but also tactile forces for proper force interaction that can adapt to individual differences in humans. Furthermore, by introducing semantic information based on object concepts, rather than relying solely on RGB data, it can be generalized to unseen feet and background. In addition, incorporating depth data helps infer relative spatial relationship between the sock and the foot. To validate its capability for semantic object conceptualization and to ensure safety, training data were collected using a mannequin, and subsequent experiments were conducted with human subjects. In experiments, the robot successfully adapted to previously unseen human feet and was able to put socks on 10 participants, achieving a higher success rate than Action Chunking with Transformer and Diffusion Policy. These results demonstrate that the proposed model can estimate the state of both the garment and the foot, enabling precise dressing assistance for close-fitting garments. △ Less

Submitted 6 May, 2025; originally announced May 2025.

arXiv:2407.13376 [pdf, other]

Dual-arm Motion Generation for Repositioning Care based on Deep Predictive Learning with Somatosensory Attention Mechanism

Authors: Tamon Miyake, Namiko Saito, Tetsuya Ogata, Yushi Wang, Shigeki Sugano

Abstract: Caregiving is a vital role for domestic robots, especially the repositioning care has immense societal value, critically improving the health and quality of life of individuals with limited mobility. However, repositioning task is a challenging area of research, as it requires robots to adapt their motions while interacting flexibly with patients. The task involves several key challenges: (1) appl… ▽ More Caregiving is a vital role for domestic robots, especially the repositioning care has immense societal value, critically improving the health and quality of life of individuals with limited mobility. However, repositioning task is a challenging area of research, as it requires robots to adapt their motions while interacting flexibly with patients. The task involves several key challenges: (1) applying appropriate force to specific target areas; (2) performing multiple actions seamlessly, each requiring different force application policies; and (3) motion adaptation under uncertain positional conditions. To address these, we propose a deep neural network (DNN)-based architecture utilizing proprioceptive and visual attention mechanisms, along with impedance control to regulate the robot's movements. Using the dual-arm humanoid robot Dry-AIREC, the proposed model successfully generated motions to insert the robot's hand between the bed and a mannequin's back without applying excessive force, and it supported the transition from a supine to a lifted-up position. The project page is here: https://sites.google.com/view/caregiving-robot-airec/repositioning △ Less

Submitted 19 June, 2025; v1 submitted 18 July, 2024; originally announced July 2024.

arXiv:2312.12491 [pdf, ps, other]

StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation

Authors: Akio Kodaira, Chenfeng Xu, Toshiki Hazama, Takanori Yoshimoto, Kohei Ohno, Shogo Mitsuhori, Soichi Sugano, Hanying Cho, Zhijian Liu, Masayoshi Tomizuka, Kurt Keutzer

Abstract: We introduce StreamDiffusion, a real-time diffusion pipeline designed for interactive image generation. Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction. This limitation becomes particularly evident in scenarios involving continuous input, such as Metaverse, live video streaming, and broadcasting, where high throu… ▽ More We introduce StreamDiffusion, a real-time diffusion pipeline designed for interactive image generation. Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction. This limitation becomes particularly evident in scenarios involving continuous input, such as Metaverse, live video streaming, and broadcasting, where high throughput is imperative. To address this, we present a novel approach that transforms the original sequential denoising into the batching denoising process. Stream Batch eliminates the conventional wait-and-interact approach and enables fluid and high throughput streams. To handle the frequency disparity between data input and model throughput, we design a novel input-output queue for parallelizing the streaming process. Moreover, the existing diffusion pipeline uses classifier-free guidance(CFG), which requires additional U-Net computation. To mitigate the redundant computations, we propose a novel residual classifier-free guidance (RCFG) algorithm that reduces the number of negative conditional denoising steps to only one or even zero. Besides, we introduce a stochastic similarity filter(SSF) to optimize power consumption. Our Stream Batch achieves around 1.5x speedup compared to the sequential denoising method at different denoising levels. The proposed RCFG leads to speeds up to 2.05x higher than the conventional CFG. Combining the proposed strategies and existing mature acceleration tools makes the image-to-image generation achieve up-to 91.07fps on one RTX4090, improving the throughputs of AutoPipline developed by Diffusers over 59.56x. Furthermore, our proposed StreamDiffusion also significantly reduces the energy consumption by 2.39x on one RTX3060 and 1.99x on one RTX4090, respectively. △ Less

Submitted 8 July, 2025; v1 submitted 19 December, 2023; originally announced December 2023.

Comments: tech report, the code is available at https://github.com/cumulo-autumn/StreamDiffusion

arXiv:2309.14837 [pdf, other]

Realtime Motion Generation with Active Perception Using Attention Mechanism for Cooking Robot

Authors: Namiko Saito, Mayu Hiramoto, Ayuna Kubo, Kanata Suzuki, Hiroshi Ito, Shigeki Sugano, Tetsuya Ogata

Abstract: To support humans in their daily lives, robots are required to autonomously learn, adapt to objects and environments, and perform the appropriate actions. We tackled on the task of cooking scrambled eggs using real ingredients, in which the robot needs to perceive the states of the egg and adjust stirring movement in real time, while the egg is heated and the state changes continuously. In previou… ▽ More To support humans in their daily lives, robots are required to autonomously learn, adapt to objects and environments, and perform the appropriate actions. We tackled on the task of cooking scrambled eggs using real ingredients, in which the robot needs to perceive the states of the egg and adjust stirring movement in real time, while the egg is heated and the state changes continuously. In previous works, handling changing objects was found to be challenging because sensory information includes dynamical, both important or noisy information, and the modality which should be focused on changes every time, making it difficult to realize both perception and motion generation in real time. We propose a predictive recurrent neural network with an attention mechanism that can weigh the sensor input, distinguishing how important and reliable each modality is, that realize quick and efficient perception and motion generation. The model is trained with learning from the demonstration, and allows the robot to acquire human-like skills. We validated the proposed technique using the robot, Dry-AIREC, and with our learning model, it could perform cooking eggs with unknown ingredients. The robot could change the method of stirring and direction depending on the status of the egg, as in the beginning it stirs in the whole pot, then subsequently, after the egg started being heated, it starts flipping and splitting motion targeting specific areas, although we did not explicitly indicate them. △ Less

Submitted 26 September, 2023; originally announced September 2023.

arXiv:2205.04169 [pdf, other]

doi 10.1109/LRA.2022.3142417

Multi-Fingered In-Hand Manipulation with Various Object Properties Using Graph Convolutional Networks and Distributed Tactile Sensors

Authors: Satoshi Funabashi, Tomoki Isobe, Fei Hongyi, Atsumu Hiramoto, Alexander Schmitz, Shigeki Sugano, Tetsuya Ogata

Abstract: Multi-fingered hands could be used to achieve many dexterous manipulation tasks, similarly to humans, and tactile sensing could enhance the manipulation stability for a variety of objects. However, tactile sensors on multi-fingered hands have a variety of sizes and shapes. Convolutional neural networks (CNN) can be useful for processing tactile information, but the information from multi-fingered… ▽ More Multi-fingered hands could be used to achieve many dexterous manipulation tasks, similarly to humans, and tactile sensing could enhance the manipulation stability for a variety of objects. However, tactile sensors on multi-fingered hands have a variety of sizes and shapes. Convolutional neural networks (CNN) can be useful for processing tactile information, but the information from multi-fingered hands needs an arbitrary pre-processing, as CNNs require a rectangularly shaped input, which may lead to unstable results. Therefore, how to process such complex shaped tactile information and utilize it for achieving manipulation skills is still an open issue. This paper presents a control method based on a graph convolutional network (GCN) which extracts geodesical features from the tactile data with complicated sensor alignments. Moreover, object property labels are provided to the GCN to adjust in-hand manipulation motions. Distributed tri-axial tactile sensors are mounted on the fingertips, finger phalanges and palm of an Allegro hand, resulting in 1152 tactile measurements. Training data is collected with a data-glove to transfer human dexterous manipulation directly to the robot hand. The GCN achieved high success rates for in-hand manipulation. We also confirmed that fragile objects were deformed less when correct object labels were provided to the GCN. When visualizing the activation of the GCN with a PCA, we verified that the network acquired geodesical features. Our method achieved stable manipulation even when an experimenter pulled a grasped object and for untrained objects. △ Less

Submitted 9 May, 2022; originally announced May 2022.

Comments: Accepted at ICRA 2022 and RA-Letter

arXiv:2106.02445 [pdf, ps, other]

doi 10.1109/LRA.2021.3062004

How to select and use tools? : Active Perception of Target Objects Using Multimodal Deep Learning

Authors: Namiko Saito, Tetsuya Ogata, Satoshi Funabashi, Hiroki Mori, Shigeki Sugano

Abstract: Selection of appropriate tools and use of them when performing daily tasks is a critical function for introducing robots for domestic applications. In previous studies, however, adaptability to target objects was limited, making it difficult to accordingly change tools and adjust actions. To manipulate various objects with tools, robots must both understand tool functions and recognize object char… ▽ More Selection of appropriate tools and use of them when performing daily tasks is a critical function for introducing robots for domestic applications. In previous studies, however, adaptability to target objects was limited, making it difficult to accordingly change tools and adjust actions. To manipulate various objects with tools, robots must both understand tool functions and recognize object characteristics to discern a tool-object-action relation. We focus on active perception using multimodal sensorimotor data while a robot interacts with objects, and allow the robot to recognize their extrinsic and intrinsic characteristics. We construct a deep neural networks (DNN) model that learns to recognize object characteristics, acquires tool-object-action relations, and generates motions for tool selection and handling. As an example tool-use situation, the robot performs an ingredients transfer task, using a turner or ladle to transfer an ingredient from a pot to a bowl. The results confirm that the robot recognizes object characteristics and servings even when the target ingredients are unknown. We also examine the contributions of images, force, and tactile data and show that learning a variety of multimodal information results in rich perception for tool use. △ Less

Submitted 4 June, 2021; originally announced June 2021.

Comments: Best Paper Award of Cognitive Robotics in ICRA2021 IEEE Robotics and Automation Letters 2021, Proceedings of the 2021 International Conference on Robotics and Automation (ICRA 2021), 2021

Journal ref: IEEE Robotics and Automation Letters 2021

arXiv:1809.11100 [pdf, other]

Rethinking Self-driving: Multi-task Knowledge for Better Generalization and Accident Explanation Ability

Authors: Zhihao Li, Toshiyuki Motoyoshi, Kazuma Sasaki, Tetsuya Ogata, Shigeki Sugano

Abstract: Current end-to-end deep learning driving models have two problems: (1) Poor generalization ability of unobserved driving environment when diversity of training driving dataset is limited (2) Lack of accident explanation ability when driving models don't work as expected. To tackle these two problems, rooted on the believe that knowledge of associated easy task is benificial for addressing difficul… ▽ More Current end-to-end deep learning driving models have two problems: (1) Poor generalization ability of unobserved driving environment when diversity of training driving dataset is limited (2) Lack of accident explanation ability when driving models don't work as expected. To tackle these two problems, rooted on the believe that knowledge of associated easy task is benificial for addressing difficult task, we proposed a new driving model which is composed of perception module for \textit{see and think} and driving module for \textit{behave}, and trained it with multi-task perception-related basic knowledge and driving knowledge stepwisely. Specifically segmentation map and depth map (pixel level understanding of images) were considered as \textit{what \& where} and \textit{how far} knowledge for tackling easier driving-related perception problems before generating final control commands for difficult driving task. The results of experiments demonstrated the effectiveness of multi-task perception knowledge for better generalization and accident explanation ability. With our method the average sucess rate of finishing most difficult navigation tasks in untrained city of CoRL test surpassed current benchmark method for 15 percent in trained weather and 20 percent in untrained weathers. Demonstration video link is: https://www.youtube.com/watch?v=N7ePnnZZwdE △ Less

Submitted 28 September, 2018; originally announced September 2018.

Comments: 11 pages, 4 figures, 3 tables

arXiv:1809.08613 [pdf, ps, other]

Detecting Features of Tools, Objects, and Actions from Effects in a Robot using Deep Learning

Authors: Namiko Saito, Kitae Kim, Shingo Murata, Tetsuya Ogata, Shigeki Sugano

Abstract: We propose a tool-use model that can detect the features of tools, target objects, and actions from the provided effects of object manipulation. We construct a model that enables robots to manipulate objects with tools, using infant learning as a concept. To realize this, we train sensory-motor data recorded during a tool-use task performed by a robot with deep learning. Experiments include four f… ▽ More We propose a tool-use model that can detect the features of tools, target objects, and actions from the provided effects of object manipulation. We construct a model that enables robots to manipulate objects with tools, using infant learning as a concept. To realize this, we train sensory-motor data recorded during a tool-use task performed by a robot with deep learning. Experiments include four factors: (1) tools, (2) objects, (3) actions, and (4) effects, which the model considers simultaneously. For evaluation, the robot generates predicted images and motions given information of the effects of using unknown tools and objects. We confirm that the robot is capable of detecting features of tools, objects, and actions by learning the effects and executing the task. △ Less

Submitted 23 September, 2018; originally announced September 2018.

Comments: 7 pages, 9 figures

Showing 1–8 of 8 results for author: Sugano, S