-
Grounded Vision-Language Interpreter for Integrated Task and Motion Planning
Authors:
Jeremy Siburian,
Keisuke Shirai,
Cristian C. Beltran-Hernandez,
Masashi Hamaya,
Michael Görner,
Atsushi Hashimoto
Abstract:
While recent advances in vision-language models (VLMs) have accelerated the development of language-guided robot planners, their black-box nature often lacks safety guarantees and interpretability crucial for real-world deployment. Conversely, classical symbolic planners offer rigorous safety verification but require significant expert knowledge for setup. To bridge the current gap, this paper pro…
▽ More
While recent advances in vision-language models (VLMs) have accelerated the development of language-guided robot planners, their black-box nature often lacks safety guarantees and interpretability crucial for real-world deployment. Conversely, classical symbolic planners offer rigorous safety verification but require significant expert knowledge for setup. To bridge the current gap, this paper proposes ViLaIn-TAMP, a hybrid planning framework for enabling verifiable, interpretable, and autonomous robot behaviors. ViLaIn-TAMP comprises three main components: (1) ViLaIn (Vision-Language Interpreter) - A prior framework that converts multimodal inputs into structured problem specifications using off-the-shelf VLMs without additional domain-specific training, (2) a modular Task and Motion Planning (TAMP) system that grounds these specifications in actionable trajectory sequences through symbolic and geometric constraint reasoning and can utilize learning-based skills for key manipulation phases, and (3) a corrective planning module which receives concrete feedback on failed solution attempts from the motion and task planning components and can feed adapted logic and geometric feasibility constraints back to ViLaIn to improve and further refine the specification. We evaluate our framework on several challenging manipulation tasks in a cooking domain. We demonstrate that the proposed closed-loop corrective architecture exhibits a more than 30% higher mean success rate for ViLaIn-TAMP compared to without corrective planning.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
KeyMPs: One-Shot Vision-Language Guided Motion Generation by Sequencing DMPs for Occlusion-Rich Tasks
Authors:
Edgar Anarossi,
Yuhwan Kwon,
Hirotaka Tahara,
Shohei Tanaka,
Keisuke Shirai,
Masashi Hamaya,
Cristian C. Beltran-Hernandez,
Atsushi Hashimoto,
Takamitsu Matsubara
Abstract:
Dynamic Movement Primitives (DMPs) provide a flexible framework wherein smooth robotic motions are encoded into modular parameters. However, they face challenges in integrating multimodal inputs commonly used in robotics like vision and language into their framework. To fully maximize DMPs' potential, enabling them to handle multimodal inputs is essential. In addition, we also aim to extend DMPs'…
▽ More
Dynamic Movement Primitives (DMPs) provide a flexible framework wherein smooth robotic motions are encoded into modular parameters. However, they face challenges in integrating multimodal inputs commonly used in robotics like vision and language into their framework. To fully maximize DMPs' potential, enabling them to handle multimodal inputs is essential. In addition, we also aim to extend DMPs' capability to handle object-focused tasks requiring one-shot complex motion generation, as observation occlusion could easily happen mid-execution in such tasks (e.g., knife occlusion in cake icing, hand occlusion in dough kneading, etc.). A promising approach is to leverage Vision-Language Models (VLMs), which process multimodal data and can grasp high-level concepts. However, they typically lack enough knowledge and capabilities to directly infer low-level motion details and instead only serve as a bridge between high-level instructions and low-level control. To address this limitation, we propose Keyword Labeled Primitive Selection and Keypoint Pairs Generation Guided Movement Primitives (KeyMPs), a framework that combines VLMs with sequencing of DMPs. KeyMPs use VLMs' high-level reasoning capability to select a reference primitive through keyword labeled primitive selection and VLMs' spatial awareness to generate spatial scaling parameters used for sequencing DMPs by generalizing the overall motion through keypoint pairs generation, which together enable one-shot vision-language guided motion generation that aligns with the intent expressed in the multimodal input. We validate our approach through an occlusion-rich manipulation task, specifically object cutting experiments in both simulated and real-world environments, demonstrating superior performance over other DMP-based methods that integrate VLMs support.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos Referring to Procedural Texts
Authors:
Yuto Haneji,
Taichi Nishimura,
Hirotaka Kameko,
Keisuke Shirai,
Tomoya Yoshida,
Keiya Kajimura,
Koki Yamamoto,
Taiyu Cui,
Tomohiro Nishimoto,
Shinsuke Mori
Abstract:
Mistake action detection is crucial for developing intelligent archives that detect workers' errors and provide feedback. Existing studies have focused on visually apparent mistakes in free-style activities, resulting in video-only approaches to mistake detection. However, in text-following activities, models cannot determine the correctness of some actions without referring to the texts. Addition…
▽ More
Mistake action detection is crucial for developing intelligent archives that detect workers' errors and provide feedback. Existing studies have focused on visually apparent mistakes in free-style activities, resulting in video-only approaches to mistake detection. However, in text-following activities, models cannot determine the correctness of some actions without referring to the texts. Additionally, current mistake datasets rarely use procedural texts for video recording except for cooking. To fill these gaps, this paper proposes the EgoOops dataset, where egocentric videos record erroneous activities when following procedural texts across diverse domains. It features three types of annotations: video-text alignment, mistake labels, and descriptions for mistakes. We also propose a mistake detection approach, combining video-text alignment and mistake label classification to leverage the texts. Our experimental results show that incorporating procedural texts is essential for mistake detection. Data is available through https://y-haneji.github.io/EgoOops-project-page/.
△ Less
Submitted 11 February, 2025; v1 submitted 7 October, 2024;
originally announced October 2024.
-
BioVL-QR: Egocentric Biochemical Vision-and-Language Dataset Using Micro QR Codes
Authors:
Tomohiro Nishimoto,
Taichi Nishimura,
Koki Yamamoto,
Keisuke Shirai,
Hirotaka Kameko,
Yuto Haneji,
Tomoya Yoshida,
Keiya Kajimura,
Taiyu Cui,
Chihiro Nishiwaki,
Eriko Daikoku,
Natsuko Okuda,
Fumihito Ono,
Shinsuke Mori
Abstract:
This paper introduces BioVL-QR, a biochemical vision-and-language dataset comprising 23 egocentric experiment videos, corresponding protocols, and vision-and-language alignments. A major challenge in understanding biochemical videos is detecting equipment, reagents, and containers because of the cluttered environment and indistinguishable objects. Previous studies assumed manual object annotation,…
▽ More
This paper introduces BioVL-QR, a biochemical vision-and-language dataset comprising 23 egocentric experiment videos, corresponding protocols, and vision-and-language alignments. A major challenge in understanding biochemical videos is detecting equipment, reagents, and containers because of the cluttered environment and indistinguishable objects. Previous studies assumed manual object annotation, which is costly and time-consuming. To address the issue, we focus on Micro QR Codes. However, detecting objects using only Micro QR Codes is still difficult due to blur and occlusion caused by object manipulation. To overcome this, we propose an object labeling method combining a Micro QR Code detector with an off-the-shelf hand object detector. As an application of the method and BioVL-QR, we tackled the task of localizing the procedural steps in an instructional video. The experimental results show that using Micro QR Codes and our method improves biochemical video understanding. Data and code are available through https://nishi10mo.github.io/BioVL-QR/
△ Less
Submitted 29 May, 2025; v1 submitted 3 April, 2024;
originally announced April 2024.
-
Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks
Authors:
Keyaki Ohno,
Hirotaka Kameko,
Keisuke Shirai,
Taichi Nishimura,
Shinsuke Mori
Abstract:
Geoparsing is the task of estimating the latitude and longitude (coordinates) of location expressions in texts. Geoparsing must deal with the ambiguity of the expressions that indicate multiple locations with the same notation. For evaluating geoparsing systems, several corpora have been proposed in previous work. However, these corpora are small-scale and suffer from the coverage of location expr…
▽ More
Geoparsing is the task of estimating the latitude and longitude (coordinates) of location expressions in texts. Geoparsing must deal with the ambiguity of the expressions that indicate multiple locations with the same notation. For evaluating geoparsing systems, several corpora have been proposed in previous work. However, these corpora are small-scale and suffer from the coverage of location expressions on general domains. In this paper, we propose Wikipedia Hyperlink-based Location Linking (WHLL), a novel method to construct a large-scale corpus for geoparsing from Wikipedia articles. WHLL leverages hyperlinks in Wikipedia to annotate multiple location expressions with coordinates. With this method, we constructed the WHLL corpus, a new large-scale corpus for geoparsing. The WHLL corpus consists of 1.3M articles, each containing about 7.8 unique location expressions. 45.6% of location expressions are ambiguous and refer to more than one location with the same notation. In each article, location expressions of the article title and those hyperlinks to other articles are assigned with coordinates. By utilizing hyperlinks, we can accurately assign location expressions with coordinates even with ambiguous location expressions in the texts. Experimental results show that there remains room for improvement by disambiguating location expressions.
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
Discovering Highly Influential Shortcut Reasoning: An Automated Template-Free Approach
Authors:
Daichi Haraguchi,
Kiyoaki Shirai,
Naoya Inoue,
Natthawut Kertkeidkachorn
Abstract:
Shortcut reasoning is an irrational process of inference, which degrades the robustness of an NLP model. While a number of previous work has tackled the identification of shortcut reasoning, there are still two major limitations: (i) a method for quantifying the severity of the discovered shortcut reasoning is not provided; (ii) certain types of shortcut reasoning may be missed. To address these i…
▽ More
Shortcut reasoning is an irrational process of inference, which degrades the robustness of an NLP model. While a number of previous work has tackled the identification of shortcut reasoning, there are still two major limitations: (i) a method for quantifying the severity of the discovered shortcut reasoning is not provided; (ii) certain types of shortcut reasoning may be missed. To address these issues, we propose a novel method for identifying shortcut reasoning. The proposed method quantifies the severity of the shortcut reasoning by leveraging out-of-distribution data and does not make any assumptions about the type of tokens triggering the shortcut reasoning. Our experiments on Natural Language Inference and Sentiment Analysis demonstrate that our framework successfully discovers known and unknown shortcut reasoning in the previous work.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
Vision-Language Interpreter for Robot Task Planning
Authors:
Keisuke Shirai,
Cristian C. Beltran-Hernandez,
Masashi Hamaya,
Atsushi Hashimoto,
Shohei Tanaka,
Kento Kawaharazuka,
Kazutoshi Tanaka,
Yoshitaka Ushiku,
Shinsuke Mori
Abstract:
Large language models (LLMs) are accelerating the development of language-guided robot planners. Meanwhile, symbolic planners offer the advantage of interpretability. This paper proposes a new task that bridges these two trends, namely, multimodal planning problem specification. The aim is to generate a problem description (PD), a machine-readable file used by the planners to find a plan. By gener…
▽ More
Large language models (LLMs) are accelerating the development of language-guided robot planners. Meanwhile, symbolic planners offer the advantage of interpretability. This paper proposes a new task that bridges these two trends, namely, multimodal planning problem specification. The aim is to generate a problem description (PD), a machine-readable file used by the planners to find a plan. By generating PDs from language instruction and scene observation, we can drive symbolic planners in a language-guided framework. We propose a Vision-Language Interpreter (ViLaIn), a new framework that generates PDs using state-of-the-art LLM and vision-language models. ViLaIn can refine generated PDs via error message feedback from the symbolic planner. Our aim is to answer the question: How accurately can ViLaIn and the symbolic planner generate valid robot plans? To evaluate ViLaIn, we introduce a novel dataset called the problem description generation (ProDG) dataset. The framework is evaluated with four new evaluation metrics. Experimental results show that ViLaIn can generate syntactically correct problems with more than 99\% accuracy and valid plans with more than 58\% accuracy. Our code and dataset are available at https://github.com/omron-sinicx/ViLaIn.
△ Less
Submitted 19 February, 2024; v1 submitted 1 November, 2023;
originally announced November 2023.
-
Towards Flow Graph Prediction of Open-Domain Procedural Texts
Authors:
Keisuke Shirai,
Hirotaka Kameko,
Shinsuke Mori
Abstract:
Machine comprehension of procedural texts is essential for reasoning about the steps and automating the procedures. However, this requires identifying entities within a text and resolving the relationships between the entities. Previous work focused on the cooking domain and proposed a framework to convert a recipe text into a flow graph (FG) representation. In this work, we propose a framework ba…
▽ More
Machine comprehension of procedural texts is essential for reasoning about the steps and automating the procedures. However, this requires identifying entities within a text and resolving the relationships between the entities. Previous work focused on the cooking domain and proposed a framework to convert a recipe text into a flow graph (FG) representation. In this work, we propose a framework based on the recipe FG for flow graph prediction of open-domain procedural texts. To investigate flow graph prediction performance in non-cooking domains, we introduce the wikiHow-FG corpus from articles on wikiHow, a website of how-to instruction articles. In experiments, we consider using the existing recipe corpus and performing domain adaptation from the cooking to the target domain. Experimental results show that the domain adaptation models achieve higher performance than those trained only on the cooking or target domain data.
△ Less
Submitted 30 May, 2023;
originally announced May 2023.
-
Visual Recipe Flow: A Dataset for Learning Visual State Changes of Objects with Recipe Flows
Authors:
Keisuke Shirai,
Atsushi Hashimoto,
Taichi Nishimura,
Hirotaka Kameko,
Shuhei Kurita,
Yoshitaka Ushiku,
Shinsuke Mori
Abstract:
We present a new multimodal dataset called Visual Recipe Flow, which enables us to learn each cooking action result in a recipe text. The dataset consists of object state changes and the workflow of the recipe text. The state change is represented as an image pair, while the workflow is represented as a recipe flow graph (r-FG). The image pairs are grounded in the r-FG, which provides the cross-mo…
▽ More
We present a new multimodal dataset called Visual Recipe Flow, which enables us to learn each cooking action result in a recipe text. The dataset consists of object state changes and the workflow of the recipe text. The state change is represented as an image pair, while the workflow is represented as a recipe flow graph (r-FG). The image pairs are grounded in the r-FG, which provides the cross-modal relation. With our dataset, one can try a range of applications, from multimodal commonsense reasoning and procedural text generation.
△ Less
Submitted 13 September, 2022;
originally announced September 2022.
-
Guided Facial Skin Color Correction
Authors:
Keiichiro Shirai,
Tatsuya Baba,
Shunsuke Ono,
Masahiro Okuda,
Yusuke Tatesumi,
Paul Perrotin
Abstract:
This paper proposes an automatic image correction method for portrait photographs, which promotes consistency of facial skin color by suppressing skin color changes due to background colors. In portrait photographs, skin color is often distorted due to the lighting environment (e.g., light reflected from a colored background wall and over-exposure by a camera strobe), and if the photo is artificia…
▽ More
This paper proposes an automatic image correction method for portrait photographs, which promotes consistency of facial skin color by suppressing skin color changes due to background colors. In portrait photographs, skin color is often distorted due to the lighting environment (e.g., light reflected from a colored background wall and over-exposure by a camera strobe), and if the photo is artificially combined with another background color, this color change is emphasized, resulting in an unnatural synthesized result. In our framework, after roughly extracting the face region and rectifying the skin color distribution in a color space, we perform color and brightness correction around the face in the original image to achieve a proper color balance of the facial image, which is not affected by luminance and background colors. Unlike conventional algorithms for color correction, our final result is attained by a color correction process with a guide image. In particular, our guided image filtering for the color correction does not require a perfectly-aligned guide image required in the original guide image filtering method proposed by He et al. Experimental results show that our method generates more natural results than conventional methods on not only headshot photographs but also natural scene photographs. We also show automatic yearbook style photo generation as an another application.
△ Less
Submitted 19 May, 2021;
originally announced May 2021.
-
Neural Text Generation with Artificial Negative Examples
Authors:
Keisuke Shirai,
Kazuma Hashimoto,
Akiko Eriguchi,
Takashi Ninomiya,
Shinsuke Mori
Abstract:
Neural text generation models conditioning on given input (e.g. machine translation and image captioning) are usually trained by maximum likelihood estimation of target text. However, the trained models suffer from various types of errors at inference time. In this paper, we propose to suppress an arbitrary type of errors by training the text generation model in a reinforcement learning framework,…
▽ More
Neural text generation models conditioning on given input (e.g. machine translation and image captioning) are usually trained by maximum likelihood estimation of target text. However, the trained models suffer from various types of errors at inference time. In this paper, we propose to suppress an arbitrary type of errors by training the text generation model in a reinforcement learning framework, where we use a trainable reward function that is capable of discriminating between references and sentences containing the targeted type of errors. We create such negative examples by artificially injecting the targeted errors to the references. In experiments, we focus on two error types, repeated and dropped tokens in model-generated text. The experimental results show that our method can suppress the generation errors and achieve significant improvements on two machine translation and two image captioning tasks.
△ Less
Submitted 28 December, 2020;
originally announced December 2020.
-
Fast Singular Value Shrinkage with Chebyshev Polynomial Approximation Based on Signal Sparsity
Authors:
Masaki Onuki,
Shunsuke Ono,
Keiichiro Shirai,
Yuichi Tanaka
Abstract:
We propose an approximation method for thresholding of singular values using Chebyshev polynomial approximation (CPA). Many signal processing problems require iterative application of singular value decomposition (SVD) for minimizing the rank of a given data matrix with other cost functions and/or constraints, which is called matrix rank minimization. In matrix rank minimization, singular values o…
▽ More
We propose an approximation method for thresholding of singular values using Chebyshev polynomial approximation (CPA). Many signal processing problems require iterative application of singular value decomposition (SVD) for minimizing the rank of a given data matrix with other cost functions and/or constraints, which is called matrix rank minimization. In matrix rank minimization, singular values of a matrix are shrunk by hard-thresholding, soft-thresholding, or weighted soft-thresholding. However, the computational cost of SVD is generally too expensive to handle high dimensional signals such as images; hence, in this case, matrix rank minimization requires enormous computation time. In this paper, we leverage CPA to (approximately) manipulate singular values without computing singular values and vectors. The thresholding of singular values is expressed by a multiplication of certain matrices, which is derived from a characteristic of CPA. The multiplication is also efficiently computed using the sparsity of signals. As a result, the computational cost is significantly reduced. Experimental results suggest the effectiveness of our method through several image processing applications based on matrix rank minimization with nuclear norm relaxation in terms of computation time and approximation precision.
△ Less
Submitted 19 May, 2017;
originally announced May 2017.
-
Fast Supervised Discrete Hashing and its Analysis
Authors:
Gou Koutaki,
Keiichiro Shirai,
Mitsuru Ambai
Abstract:
In this paper, we propose a learning-based supervised discrete hashing method. Binary hashing is widely used for large-scale image retrieval as well as video and document searches because the compact representation of binary code is essential for data storage and reasonable for query searches using bit-operations. The recently proposed Supervised Discrete Hashing (SDH) efficiently solves mixed-int…
▽ More
In this paper, we propose a learning-based supervised discrete hashing method. Binary hashing is widely used for large-scale image retrieval as well as video and document searches because the compact representation of binary code is essential for data storage and reasonable for query searches using bit-operations. The recently proposed Supervised Discrete Hashing (SDH) efficiently solves mixed-integer programming problems by alternating optimization and the Discrete Cyclic Coordinate descent (DCC) method. We show that the SDH model can be simplified without performance degradation based on some preliminary experiments; we call the approximate model for this the "Fast SDH" (FSDH) model. We analyze the FSDH model and provide a mathematically exact solution for it. In contrast to SDH, our model does not require an alternating optimization algorithm and does not depend on initial values. FSDH is also easier to implement than Iterative Quantization (ITQ). Experimental results involving a large-scale database showed that FSDH outperforms conventional SDH in terms of precision, recall, and computation time.
△ Less
Submitted 30 November, 2016;
originally announced November 2016.