-
Weakly-Supervised Domain Adaptation with Proportion-Constrained Pseudo-Labeling
Authors:
Takumi Okuo,
Shinnosuke Matsuo,
Shota Harada,
Kiyohito Tanaka,
Ryoma Bise
Abstract:
Domain shift is a significant challenge in machine learning, particularly in medical applications where data distributions differ across institutions due to variations in data collection practices, equipment, and procedures. This can degrade performance when models trained on source domain data are applied to the target domain. Domain adaptation methods have been widely studied to address this iss…
▽ More
Domain shift is a significant challenge in machine learning, particularly in medical applications where data distributions differ across institutions due to variations in data collection practices, equipment, and procedures. This can degrade performance when models trained on source domain data are applied to the target domain. Domain adaptation methods have been widely studied to address this issue, but most struggle when class proportions between the source and target domains differ. In this paper, we propose a weakly-supervised domain adaptation method that leverages class proportion information from the target domain, which is often accessible in medical datasets through prior knowledge or statistical reports. Our method assigns pseudo-labels to the unlabeled target data based on class proportion (called proportion-constrained pseudo-labeling), improving performance without the need for additional annotations. Experiments on two endoscopic datasets demonstrate that our method outperforms semi-supervised domain adaptation techniques, even when 5% of the target domain is labeled. Additionally, the experimental results with noisy proportion labels highlight the robustness of our method, further demonstrating its effectiveness in real-world application scenarios.
△ Less
Submitted 27 June, 2025;
originally announced June 2025.
-
Learning to assess subjective impressions from speech
Authors:
Yuto Kondo,
Hirokazu Kameoka,
Kou Tanaka,
Takuhiro Kaneko,
Noboru Harada
Abstract:
We tackle a new task of training neural network models that can assess subjective impressions conveyed through speech and assign scores accordingly, inspired by the work on automatic speech quality assessment (SQA). Speech impressions are often described using phrases like `cute voice.' We define such phrases as subjective voice descriptors (SVDs). Focusing on the difference in usage scenarios bet…
▽ More
We tackle a new task of training neural network models that can assess subjective impressions conveyed through speech and assign scores accordingly, inspired by the work on automatic speech quality assessment (SQA). Speech impressions are often described using phrases like `cute voice.' We define such phrases as subjective voice descriptors (SVDs). Focusing on the difference in usage scenarios between the proposed task and automatic SQA, we design a framework capable of accommodating SVDs personalized to each individual, such as `my favorite voice.' In this work, we compiled a dataset containing speech labels derived from both abosolute category ratings (ACR) and comparison category ratings (CCR).
As an evaluation metric for assessment performance, we introduce ppref, the accuracy of the predicted score ordering of two samples on CCR test samples. Alongside the conventional model and learning methods based on ACR data, we also investigated RankNet learning using CCR data. We experimentally find that the ppref is moderate even with very limited training data. We also discover the CCR training is superior to the ACR training. These results support the idea that assessment models based on personalized SVDs, which typically must be trained on limited data, can be effectively learned from CCR data.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
Selecting N-lowest scores for training MOS prediction models
Authors:
Yuto Kondo,
Hirokazu Kameoka,
Kou Tanaka,
Takuhiro Kaneko
Abstract:
The automatic speech quality assessment (SQA) has been extensively studied to predict the speech quality without time-consuming questionnaires. Recently, neural-based SQA models have been actively developed for speech samples produced by text-to-speech or voice conversion, with a primary focus on training mean opinion score (MOS) prediction models. The quality of each speech sample may not be cons…
▽ More
The automatic speech quality assessment (SQA) has been extensively studied to predict the speech quality without time-consuming questionnaires. Recently, neural-based SQA models have been actively developed for speech samples produced by text-to-speech or voice conversion, with a primary focus on training mean opinion score (MOS) prediction models. The quality of each speech sample may not be consistent across the entire duration, and it remains unclear which segments of the speech receive the primary focus from humans when assigning subjective evaluation for MOS calculation. We hypothesize that when humans rate speech, they tend to assign more weight to low-quality speech segments, and the variance in ratings for each sample is mainly due to accidental assignment of higher scores when overlooking the poor quality speech segments. Motivated by the hypothesis, we analyze the VCC2018 and BVCC datasets. Based on the hypothesis, we propose the more reliable representative value N_low-MOS, the mean of the $N$-lowest opinion scores. Our experiments show that LCC and SRCC improve compared to regular MOS when employing N_low-MOS to MOSNet training. This result suggests that N_low-MOS is a more intrinsic representative value of subjective speech quality and makes MOSNet a better comparator of VC models.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
Rethinking Mean Opinion Scores in Speech Quality Assessment: Aggregation through Quantized Distribution Fitting
Authors:
Yuto Kondo,
Hirokazu Kameoka,
Kou Tanaka,
Takuhiro Kaneko
Abstract:
Speech quality assessment (SQA) aims to evaluate the quality of speech samples without relying on time-consuming listener questionnaires. Recent efforts have focused on training neural-based SQA models to predict the mean opinion score (MOS) of speech samples produced by text-to-speech or voice conversion systems. This paper targets the enhancement of MOS prediction models' performance. We propose…
▽ More
Speech quality assessment (SQA) aims to evaluate the quality of speech samples without relying on time-consuming listener questionnaires. Recent efforts have focused on training neural-based SQA models to predict the mean opinion score (MOS) of speech samples produced by text-to-speech or voice conversion systems. This paper targets the enhancement of MOS prediction models' performance. We propose a novel score aggregation method to address the limitations of conventional annotations for MOS, which typically involve ratings on a scale from 1 to 5. Our method is based on the hypothesis that annotators internally consider continuous scores and then choose the nearest discrete rating. By modeling this process, we approximate the generative distribution of ratings by quantizing the latent continuous distribution. We then use the peak of this latent distribution, estimated through the loss between the quantized distribution and annotated ratings, as a new representative value instead of MOS. Experimental results demonstrate that substituting MOSNet's predicted target with this proposed value improves prediction performance.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles
Authors:
Yuto Kondo,
Hirokazu Kameoka,
Kou Tanaka,
Takuhiro Kaneko
Abstract:
We construct Japanese Idol Speech Corpus (JIS) to advance research in speech generation AI, including text-to-speech synthesis (TTS) and voice conversion (VC). JIS will facilitate more rigorous evaluations of speaker similarity in TTS and VC systems since all speakers in JIS belong to a highly specific category: "young female live idols" in Japan, and each speaker is identified by a stage name, en…
▽ More
We construct Japanese Idol Speech Corpus (JIS) to advance research in speech generation AI, including text-to-speech synthesis (TTS) and voice conversion (VC). JIS will facilitate more rigorous evaluations of speaker similarity in TTS and VC systems since all speakers in JIS belong to a highly specific category: "young female live idols" in Japan, and each speaker is identified by a stage name, enabling researchers to recruit listeners familiar with these idols for listening experiments. With its unique speaker attributes, JIS will foster compelling research, including generating voices tailored to listener preferences-an area not yet widely studied. JIS will be distributed free of charge to promote research in speech generation AI, with usage restricted to non-commercial, basic research. We describe the construction of JIS, provide an overview of Japanese live idol culture to support effective and ethical use of JIS, and offer a basic analysis to guide application of JIS.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
MOON: Multi-Objective Optimization-Driven Object-Goal Navigation Using a Variable-Horizon Set-Orienteering Planner
Authors:
Daigo Nakajima,
Kanji Tanaka,
Daiki Iwata,
Kouki Terashima
Abstract:
Object-goal navigation (ON) enables autonomous robots to locate and reach user-specified objects in previously unknown environments, offering promising applications in domains such as assistive care and disaster response. Existing ON methods -- including training-free approaches, reinforcement learning, and zero-shot planners -- generally depend on active exploration to identify landmark objects (…
▽ More
Object-goal navigation (ON) enables autonomous robots to locate and reach user-specified objects in previously unknown environments, offering promising applications in domains such as assistive care and disaster response. Existing ON methods -- including training-free approaches, reinforcement learning, and zero-shot planners -- generally depend on active exploration to identify landmark objects (e.g., kitchens or desks), followed by navigation toward semantically related targets (e.g., a specific mug). However, these methods often lack strategic planning and do not adequately address trade-offs among multiple objectives. To overcome these challenges, we propose a novel framework that formulates ON as a multi-objective optimization problem (MOO), balancing frontier-based knowledge exploration with knowledge exploitation over previously observed landmarks; we call this framework MOON (MOO-driven ON). We implement a prototype MOON system that integrates three key components: (1) building on QOM [IROS05], a classical ON system that compactly and discriminatively encodes landmarks based on their semantic relevance to the target; (2) integrating StructNav [RSS23], a recently proposed training-free planner, to enhance the navigation pipeline; and (3) introducing a variable-horizon set orienteering problem formulation to enable global optimization over both exploration and exploitation strategies. This work represents an important first step toward developing globally optimized, next-generation object-goal navigation systems.
△ Less
Submitted 26 May, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
SCU-Hand: Soft Conical Universal Robotic Hand for Scooping Granular Media from Containers of Various Sizes
Authors:
Tomoya Takahashi,
Cristian C. Beltran-Hernandez,
Yuki Kuroda,
Kazutoshi Tanaka,
Masashi Hamaya,
Yoshitaka Ushiku
Abstract:
Automating small-scale experiments in materials science presents challenges due to the heterogeneous nature of experimental setups. This study introduces the SCU-Hand (Soft Conical Universal Robot Hand), a novel end-effector designed to automate the task of scooping powdered samples from various container sizes using a robotic arm. The SCU-Hand employs a flexible, conical structure that adapts to…
▽ More
Automating small-scale experiments in materials science presents challenges due to the heterogeneous nature of experimental setups. This study introduces the SCU-Hand (Soft Conical Universal Robot Hand), a novel end-effector designed to automate the task of scooping powdered samples from various container sizes using a robotic arm. The SCU-Hand employs a flexible, conical structure that adapts to different container geometries through deformation, maintaining consistent contact without complex force sensing or machine learning-based control methods. Its reconfigurable mechanism allows for size adjustment, enabling efficient scooping from diverse container types. By combining soft robotics principles with a sheet-morphing design, our end-effector achieves high flexibility while retaining the necessary stiffness for effective powder manipulation. We detail the design principles, fabrication process, and experimental validation of the SCU-Hand. Experimental validation showed that the scooping capacity is about 20% higher than that of a commercial tool, with a scooping performance of more than 95% for containers of sizes between 67 mm to 110 mm. This research contributes to laboratory automation by offering a cost-effective, easily implementable solution for automating tasks such as materials synthesis and characterization processes.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
High-Fidelity Pseudo-label Generation by Large Language Models for Training Robust Radiology Report Classifiers
Authors:
Brian Wong,
Kaito Tanaka
Abstract:
Automated labeling of chest X-ray reports is essential for enabling downstream tasks such as training image-based diagnostic models, population health studies, and clinical decision support. However, the high variability, complexity, and prevalence of negation and uncertainty in these free-text reports pose significant challenges for traditional Natural Language Processing methods. While large lan…
▽ More
Automated labeling of chest X-ray reports is essential for enabling downstream tasks such as training image-based diagnostic models, population health studies, and clinical decision support. However, the high variability, complexity, and prevalence of negation and uncertainty in these free-text reports pose significant challenges for traditional Natural Language Processing methods. While large language models (LLMs) demonstrate strong text understanding, their direct application for large-scale, efficient labeling is limited by computational cost and speed. This paper introduces DeBERTa-RAD, a novel two-stage framework that combines the power of state-of-the-art LLM pseudo-labeling with efficient DeBERTa-based knowledge distillation for accurate and fast chest X-ray report labeling. We leverage an advanced LLM to generate high-quality pseudo-labels, including certainty statuses, for a large corpus of reports. Subsequently, a DeBERTa-Base model is trained on this pseudo-labeled data using a tailored knowledge distillation strategy. Evaluated on the expert-annotated MIMIC-500 benchmark, DeBERTa-RAD achieves a state-of-the-art Macro F1 score of 0.9120, significantly outperforming established rule-based systems, fine-tuned transformer models, and direct LLM inference, while maintaining a practical inference speed suitable for high-throughput applications. Our analysis shows particular strength in handling uncertain findings. This work demonstrates a promising path to overcome data annotation bottlenecks and achieve high-performance medical text processing through the strategic combination of LLM capabilities and efficient student models trained via distillation.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
Formula-Supervised Sound Event Detection: Pre-Training Without Real Data
Authors:
Yuto Shibata,
Keitaro Tanaka,
Yoshiaki Bando,
Keisuke Imoto,
Hirokatsu Kataoka,
Yoshimitsu Aoki
Abstract:
In this paper, we propose a novel formula-driven supervised learning (FDSL) framework for pre-training an environmental sound analysis model by leveraging acoustic signals parametrically synthesized through formula-driven methods. Specifically, we outline detailed procedures and evaluate their effectiveness for sound event detection (SED). The SED task, which involves estimating the types and timi…
▽ More
In this paper, we propose a novel formula-driven supervised learning (FDSL) framework for pre-training an environmental sound analysis model by leveraging acoustic signals parametrically synthesized through formula-driven methods. Specifically, we outline detailed procedures and evaluate their effectiveness for sound event detection (SED). The SED task, which involves estimating the types and timings of sound events, is particularly challenged by the difficulty of acquiring a sufficient quantity of accurately labeled training data. Moreover, it is well known that manually annotated labels often contain noises and are significantly influenced by the subjective judgment of annotators. To address these challenges, we propose a novel pre-training method that utilizes a synthetic dataset, Formula-SED, where acoustic data are generated solely based on mathematical formulas. The proposed method enables large-scale pre-training by using the synthesis parameters applied at each time step as ground truth labels, thereby eliminating label noise and bias. We demonstrate that large-scale pre-training with Formula-SED significantly enhances model accuracy and accelerates training, as evidenced by our results in the DESED dataset used for DCASE2023 Challenge Task 4. The project page is at https://yutoshibata07.github.io/Formula-SED/
△ Less
Submitted 6 April, 2025;
originally announced April 2025.
-
LGR: LLM-Guided Ranking of Frontiers for Object Goal Navigation
Authors:
Mitsuaki Uno,
Kanji Tanaka,
Daiki Iwata,
Yudai Noda,
Shoya Miyazaki,
Kouki Terashima
Abstract:
Object Goal Navigation (OGN) is a fundamental task for robots and AI, with key applications such as mobile robot image databases (MRID). In particular, mapless OGN is essential in scenarios involving unknown or dynamic environments. This study aims to enhance recent modular mapless OGN systems by leveraging the commonsense reasoning capabilities of large language models (LLMs). Specifically, we ad…
▽ More
Object Goal Navigation (OGN) is a fundamental task for robots and AI, with key applications such as mobile robot image databases (MRID). In particular, mapless OGN is essential in scenarios involving unknown or dynamic environments. This study aims to enhance recent modular mapless OGN systems by leveraging the commonsense reasoning capabilities of large language models (LLMs). Specifically, we address the challenge of determining the visiting order in frontier-based exploration by framing it as a frontier ranking problem. Our approach is grounded in recent findings that, while LLMs cannot determine the absolute value of a frontier, they excel at evaluating the relative value between multiple frontiers viewed within a single image using the view image as context. We dynamically manage the frontier list by adding and removing elements, using an LLM as a ranking model. The ranking results are represented as reciprocal rank vectors, which are ideal for multi-view, multi-query information fusion. We validate the effectiveness of our method through evaluations in Habitat-Sim.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
Dynamic-Dark SLAM: RGB-Thermal Cooperative Robot Vision Strategy for Multi-Person Tracking in Both Well-Lit and Low-Light Scenes
Authors:
Tatsuro Sakai,
Kanji Tanaka,
Jonathan Tay Yu Liang,
Muhammad Adil Luqman,
Daiki Iwata
Abstract:
In robot vision, thermal cameras hold great potential for recognizing humans even in complete darkness. However, their application to multi-person tracking (MPT) has been limited due to data scarcity and the inherent difficulty of distinguishing individuals. In this study, we propose a cooperative MPT system that utilizes co-located RGB and thermal cameras, where pseudo-annotations (bounding boxes…
▽ More
In robot vision, thermal cameras hold great potential for recognizing humans even in complete darkness. However, their application to multi-person tracking (MPT) has been limited due to data scarcity and the inherent difficulty of distinguishing individuals. In this study, we propose a cooperative MPT system that utilizes co-located RGB and thermal cameras, where pseudo-annotations (bounding boxes and person IDs) are used to train both RGB and thermal trackers. Evaluation experiments demonstrate that the thermal tracker performs robustly in both bright and dark environments. Moreover, the results suggest that a tracker-switching strategy -- guided by a binary brightness classifier -- is more effective for information integration than a tracker-fusion approach. As an application example, we present an image change pattern recognition (ICPR) method, the ``human-as-landmark,'' which combines two key properties: the thermal recognizability of humans in dark environments and the rich landmark characteristics -- appearance, geometry, and semantics -- of static objects (occluders). Whereas conventional SLAM focuses on mapping static landmarks in well-lit environments, the present study takes a first step toward a new Human-Only SLAM paradigm, ``DD-SLAM,'' which aims to map even dynamic landmarks in complete darkness.
△ Less
Submitted 13 April, 2025; v1 submitted 16 March, 2025;
originally announced March 2025.
-
System 0/1/2/3: Quad-process theory for multi-timescale embodied collective cognitive systems
Authors:
Tadahiro Taniguchi,
Yasushi Hirai,
Masahiro Suzuki,
Shingo Murata,
Takato Horii,
Kazutoshi Tanaka
Abstract:
This paper introduces the System 0/1/2/3 framework as an extension of dual-process theory, employing a quad-process model of cognition. Expanding upon System 1 (fast, intuitive thinking) and System 2 (slow, deliberative thinking), we incorporate System 0, which represents pre-cognitive embodied processes, and System 3, which encompasses collective intelligence and symbol emergence. We contextualiz…
▽ More
This paper introduces the System 0/1/2/3 framework as an extension of dual-process theory, employing a quad-process model of cognition. Expanding upon System 1 (fast, intuitive thinking) and System 2 (slow, deliberative thinking), we incorporate System 0, which represents pre-cognitive embodied processes, and System 3, which encompasses collective intelligence and symbol emergence. We contextualize this model within Bergson's philosophy by adopting multi-scale time theory to unify the diverse temporal dynamics of cognition. System 0 emphasizes morphological computation and passive dynamics, illustrating how physical embodiment enables adaptive behavior without explicit neural processing. Systems 1 and 2 are explained from a constructive perspective, incorporating neurodynamical and AI viewpoints. In System 3, we introduce collective predictive coding to explain how societal-level adaptation and symbol emergence operate over extended timescales. This comprehensive framework ranges from rapid embodied reactions to slow-evolving collective intelligence, offering a unified perspective on cognition across multiple timescales, levels of abstraction, and forms of human intelligence. The System 0/1/2/3 model provides a novel theoretical foundation for understanding the interplay between adaptive and cognitive processes, thereby opening new avenues for research in cognitive science, AI, robotics, and collective intelligence.
△ Less
Submitted 13 March, 2025; v1 submitted 8 March, 2025;
originally announced March 2025.
-
Continual Multi-Robot Learning from Black-Box Visual Place Recognition Models
Authors:
Kenta Tsukahara,
Kanji Tanaka,
Daiki Iwata,
Jonathan Tay Yu Liang
Abstract:
In the context of visual place recognition (VPR), continual learning (CL) techniques offer significant potential for avoiding catastrophic forgetting when learning new places. However, existing CL methods often focus on knowledge transfer from a known model to a new one, overlooking the existence of unknown black-box models. We explore a novel multi-robot CL approach that enables knowledge transfe…
▽ More
In the context of visual place recognition (VPR), continual learning (CL) techniques offer significant potential for avoiding catastrophic forgetting when learning new places. However, existing CL methods often focus on knowledge transfer from a known model to a new one, overlooking the existence of unknown black-box models. We explore a novel multi-robot CL approach that enables knowledge transfer from black-box VPR models (teachers), such as those of local robots encountered by traveler robots (students) in unknown environments. Specifically, we introduce Membership Inference Attack, or MIA, the only major privacy attack applicable to black-box models, and leverage it to reconstruct pseudo training sets, which serve as the key knowledge to be exchanged between robots, from black-box VPR models. Furthermore, we aim to overcome the inherently low sampling efficiency of MIA by leveraging insights on place class prediction distribution and un-learned class detection imported from the VPR literature as a prior distribution. We also analyze both the individual effects of these methods and their combined impact. Experimental results demonstrate that our black-box MIA (BB-MIA) approach is remarkably powerful despite its simplicity, significantly enhancing the VPR capability of lower-performing robots through brief communication with other robots. This study contributes to optimizing knowledge sharing between robots in VPR and enhancing autonomy in open-world environments with multi-robot systems that are fault-tolerant and scalable.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Transtiff: A Stylus-shaped Interface for Rendering Perceived Stiffness of Virtual Objects via Stylus Stiffness Control
Authors:
Ryoya Komatsu,
Ayumu Ogura,
Shigeo Yoshida,
Kazutoshi Tanaka,
Yuichi Itoh
Abstract:
The replication of object stiffness is essential for enhancing haptic feedback in virtual environments. However, existing research has overlooked how stylus stiffness influences the perception of virtual object stiffness during tool-mediated interactions. To address this, we conducted a psychophysical experiment demonstrating that changing stylus stiffness combined with visual stimuli altered user…
▽ More
The replication of object stiffness is essential for enhancing haptic feedback in virtual environments. However, existing research has overlooked how stylus stiffness influences the perception of virtual object stiffness during tool-mediated interactions. To address this, we conducted a psychophysical experiment demonstrating that changing stylus stiffness combined with visual stimuli altered users' perception of virtual object stiffness. Based on these insights, we developed Transtiff, a stylus-shaped interface capable of on-demand stiffness control using a McKibben artificial muscle mechanism. Unlike previous approaches, our method manipulates the perceived stiffness of virtual objects via the stylus by controlling the stiffness of the stylus without altering the properties of the real object being touched, creating the illusion of a hard object feeing soft. Our user study confirmed that Transtiff effectively simulates a range of material properties, such as sponge, plastic, and tennis balls, providing haptic rendering that is closely aligned with the perceived material characteristics. By addressing the challenge of delivering realistic haptic feedback through tool-based interactions, Transtiff represents a significant advancement in the haptic interface design for VR applications.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
Relaxation-assisted reverse annealing on nonnegative/binary matrix factorization
Authors:
Renichiro Haba,
Masayuki Ohzeki,
Kazuyuki Tanaka
Abstract:
Quantum annealing has garnered significant attention as meta-heuristics inspired by quantum physics for combinatorial optimization problems. Among its many applications, nonnegative/binary matrix factorization stands out for its complexity and relevance in unsupervised machine learning. The use of reverse annealing, a derivative procedure of quantum annealing to prioritize the search in a vicinity…
▽ More
Quantum annealing has garnered significant attention as meta-heuristics inspired by quantum physics for combinatorial optimization problems. Among its many applications, nonnegative/binary matrix factorization stands out for its complexity and relevance in unsupervised machine learning. The use of reverse annealing, a derivative procedure of quantum annealing to prioritize the search in a vicinity under a given initial state, helps improve its optimization performance in matrix factorization. This study proposes an improved strategy that integrates reverse annealing with a linear programming relaxation technique. Using relaxed solutions as the initial configuration for reverse annealing, we demonstrate improvements in optimization performance comparable to the exact optimization methods. Our experiments on facial image datasets show that our method provides better convergence than known reverse annealing methods. Furthermore, we investigate the effectiveness of relaxation-based initialization methods on randomized datasets, demonstrating a relationship between the relaxed solution and the optimal solution. This research underscores the potential of combining reverse annealing and classical optimization strategies to enhance optimization performance.
△ Less
Submitted 3 January, 2025;
originally announced January 2025.
-
LMD-PGN: Cross-Modal Knowledge Distillation from First-Person-View Images to Third-Person-View BEV Maps for Universal Point Goal Navigation
Authors:
Riku Uemura,
Kanji Tanaka,
Kenta Tsukahara,
Daiki Iwata
Abstract:
Point goal navigation (PGN) is a mapless navigation approach that trains robots to visually navigate to goal points without relying on pre-built maps. Despite significant progress in handling complex environments using deep reinforcement learning, current PGN methods are designed for single-robot systems, limiting their generalizability to multi-robot scenarios with diverse platforms. This paper a…
▽ More
Point goal navigation (PGN) is a mapless navigation approach that trains robots to visually navigate to goal points without relying on pre-built maps. Despite significant progress in handling complex environments using deep reinforcement learning, current PGN methods are designed for single-robot systems, limiting their generalizability to multi-robot scenarios with diverse platforms. This paper addresses this limitation by proposing a knowledge transfer framework for PGN, allowing a teacher robot to transfer its learned navigation model to student robots, including those with unknown or black-box platforms. We introduce a novel knowledge distillation (KD) framework that transfers first-person-view (FPV) representations (view images, turning/forward actions) to universally applicable third-person-view (TPV) representations (local maps, subgoals). The state is redefined as reconstructed local maps using SLAM, while actions are mapped to subgoals on a predefined grid. To enhance training efficiency, we propose a sampling-efficient KD approach that aligns training episodes via a noise-robust local map descriptor (LMD). Although validated on 2D wheeled robots, this method can be extended to 3D action spaces, such as drones. Experiments conducted in Habitat-Sim demonstrate the feasibility of the proposed framework, requiring minimal implementation effort. This study highlights the potential for scalable and cross-platform PGN solutions, expanding the applicability of embodied AI systems in multi-robot scenarios.
△ Less
Submitted 23 December, 2024;
originally announced December 2024.
-
ON as ALC: Active Loop Closing Object Goal Navigation
Authors:
Daiki Iwata,
Kanji Tanaka,
Shoya Miyazaki,
Kouki Terashima
Abstract:
In simultaneous localization and mapping, active loop closing (ALC) is an active vision problem that aims to visually guide a robot to maximize the chances of revisiting previously visited points, thereby resetting the drift errors accumulated in the incrementally built map during travel. However, current mainstream navigation strategies that leverage such incomplete maps as workspace prior knowle…
▽ More
In simultaneous localization and mapping, active loop closing (ALC) is an active vision problem that aims to visually guide a robot to maximize the chances of revisiting previously visited points, thereby resetting the drift errors accumulated in the incrementally built map during travel. However, current mainstream navigation strategies that leverage such incomplete maps as workspace prior knowledge often fail in modern long-term autonomy long-distance travel scenarios where map accumulation errors become significant. To address these limitations of map-based navigation, this paper is the first to explore mapless navigation in the embodied AI field, in particular, to utilize object-goal navigation (commonly abbreviated as ON, ObjNav, or OGN) techniques that efficiently explore target objects without using such a prior map. Specifically, in this work, we start from an off-the-shelf mapless ON planner, extend it to utilize a prior map, and further show that the performance in long-distance ALC (LD-ALC) can be maximized by minimizing ``ALC loss" and ``ON loss". This study highlights a simple and effective approach, called ALC-ON (ALCON), to accelerate the progress of challenging long-distance ALC technology by leveraging the growing frontier-guided, data-driven, and LLM-guided ON technologies.
△ Less
Submitted 14 May, 2025; v1 submitted 16 December, 2024;
originally announced December 2024.
-
Optimizing Vision-Language Interactions Through Decoder-Only Models
Authors:
Kaito Tanaka,
Benjamin Tan,
Brian Wong
Abstract:
Vision-Language Models (VLMs) have emerged as key enablers for multimodal tasks, but their reliance on separate visual encoders introduces challenges in efficiency, scalability, and modality alignment. To address these limitations, we propose MUDAIF (Multimodal Unified Decoder with Adaptive Input Fusion), a decoder-only vision-language model that seamlessly integrates visual and textual inputs thr…
▽ More
Vision-Language Models (VLMs) have emerged as key enablers for multimodal tasks, but their reliance on separate visual encoders introduces challenges in efficiency, scalability, and modality alignment. To address these limitations, we propose MUDAIF (Multimodal Unified Decoder with Adaptive Input Fusion), a decoder-only vision-language model that seamlessly integrates visual and textual inputs through a novel Vision-Token Adapter (VTA) and adaptive co-attention mechanism. By eliminating the need for a visual encoder, MUDAIF achieves enhanced efficiency, flexibility, and cross-modal understanding. Trained on a large-scale dataset of 45M image-text pairs, MUDAIF consistently outperforms state-of-the-art methods across multiple benchmarks, including VQA, image captioning, and multimodal reasoning tasks. Extensive analyses and human evaluations demonstrate MUDAIF's robustness, generalization capabilities, and practical usability, establishing it as a new standard in encoder-free vision-language models.
△ Less
Submitted 14 December, 2024;
originally announced December 2024.
-
SyncViolinist: Music-Oriented Violin Motion Generation Based on Bowing and Fingering
Authors:
Hiroki Nishizawa,
Keitaro Tanaka,
Asuka Hirata,
Shugo Yamaguchi,
Qi Feng,
Masatoshi Hamanaka,
Shigeo Morishima
Abstract:
Automatically generating realistic musical performance motion can greatly enhance digital media production, often involving collaboration between professionals and musicians. However, capturing the intricate body, hand, and finger movements required for accurate musical performances is challenging. Existing methods often fall short due to the complex mapping between audio and motion, typically req…
▽ More
Automatically generating realistic musical performance motion can greatly enhance digital media production, often involving collaboration between professionals and musicians. However, capturing the intricate body, hand, and finger movements required for accurate musical performances is challenging. Existing methods often fall short due to the complex mapping between audio and motion, typically requiring additional inputs like scores or MIDI data. In this work, we present SyncViolinist, a multi-stage end-to-end framework that generates synchronized violin performance motion solely from audio input. Our method overcomes the challenge of capturing both global and fine-grained performance features through two key modules: a bowing/fingering module and a motion generation module. The bowing/fingering module extracts detailed playing information from the audio, which the motion generation module uses to create precise, coordinated body motions reflecting the temporal granularity and nature of the violin performance. We demonstrate the effectiveness of SyncViolinist with significantly improved qualitative and quantitative results from unseen violin performance audio, outperforming state-of-the-art methods. Extensive subjective evaluations involving professional violinists further validate our approach. The code and dataset are available at https://github.com/Kakanat/SyncViolinist.
△ Less
Submitted 11 December, 2024;
originally announced December 2024.
-
Ordinal Multiple-instance Learning for Ulcerative Colitis Severity Estimation with Selective Aggregated Transformer
Authors:
Kaito Shiku,
Kazuya Nishimura,
Daiki Suehiro,
Kiyohito Tanaka,
Ryoma Bise
Abstract:
Patient-level diagnosis of severity in ulcerative colitis (UC) is common in real clinical settings, where the most severe score in a patient is recorded. However, previous UC classification methods (i.e., image-level estimation) mainly assumed the input was a single image. Thus, these methods can not utilize severity labels recorded in real clinical settings. In this paper, we propose a patient-le…
▽ More
Patient-level diagnosis of severity in ulcerative colitis (UC) is common in real clinical settings, where the most severe score in a patient is recorded. However, previous UC classification methods (i.e., image-level estimation) mainly assumed the input was a single image. Thus, these methods can not utilize severity labels recorded in real clinical settings. In this paper, we propose a patient-level severity estimation method by a transformer with selective aggregator tokens, where a severity label is estimated from multiple images taken from a patient, similar to a clinical setting. Our method can effectively aggregate features of severe parts from a set of images captured in each patient, and it facilitates improving the discriminative ability between adjacent severity classes. Experiments demonstrate the effectiveness of the proposed method on two datasets compared with the state-of-the-art MIL methods. Moreover, we evaluated our method in real clinical settings and confirmed that our method outperformed the previous image-level methods. The code is publicly available at https://github.com/Shiku-Kaito/Ordinal-Multiple-instance-Learning-for-Ulcerative-Colitis-Severity-Estimation.
△ Less
Submitted 22 November, 2024;
originally announced November 2024.
-
Long-term Detection System for Six Kinds of Abnormal Behavior of the Elderly Living Alone
Authors:
Kai Tanaka,
Mineichi Kudo,
Keigo Kimura,
Atsuyoshi Nakamura
Abstract:
The proportion of elderly people is increasing worldwide, particularly those living alone in Japan. As elderly people get older, their risks of physical disabilities and health issues increase. To automatically discover these issues at a low cost in daily life, sensor-based detection in a smart home is promising. As part of the effort towards early detection of abnormal behaviors, we propose a sim…
▽ More
The proportion of elderly people is increasing worldwide, particularly those living alone in Japan. As elderly people get older, their risks of physical disabilities and health issues increase. To automatically discover these issues at a low cost in daily life, sensor-based detection in a smart home is promising. As part of the effort towards early detection of abnormal behaviors, we propose a simulator-based detection systems for six typical anomalies: being semi-bedridden, being housebound, forgetting, wandering, fall while walking and fall while standing. Our detection system can be customized for various room layout, sensor arrangement and resident's characteristics by training detection classifiers using the simulator with the parameters fitted to individual cases. Considering that the six anomalies that our system detects have various occurrence durations, such as being housebound for weeks or lying still for seconds after a fall, the detection classifiers of our system produce anomaly labels depending on each anomaly's occurrence duration, e.g., housebound per day and falls per second. We propose a method that standardizes the processing of sensor data, and uses a simple detection approach. Although the validity depends on the realism of the simulation, numerical evaluations using sensor data that includes a variety of resident behavior patterns over nine years as test data show that (1) the methods for detecting wandering and falls are comparable to previous methods, and (2) the methods for detecting being semi-bedridden, being housebound, and forgetting achieve a sensitivity of over 0.9 with fewer than one false alarm every 50 days.
△ Less
Submitted 20 November, 2024;
originally announced November 2024.
-
Self-Relaxed Joint Training: Sample Selection for Severity Estimation with Ordinal Noisy Labels
Authors:
Shumpei Takezaki,
Kiyohito Tanaka,
Seiichi Uchida
Abstract:
Severity level estimation is a crucial task in medical image diagnosis. However, accurately assigning severity class labels to individual images is very costly and challenging. Consequently, the attached labels tend to be noisy. In this paper, we propose a new framework for training with ``ordinal'' noisy labels. Since severity levels have an ordinal relationship, we can leverage this to train a c…
▽ More
Severity level estimation is a crucial task in medical image diagnosis. However, accurately assigning severity class labels to individual images is very costly and challenging. Consequently, the attached labels tend to be noisy. In this paper, we propose a new framework for training with ``ordinal'' noisy labels. Since severity levels have an ordinal relationship, we can leverage this to train a classifier while mitigating the negative effects of noisy labels. Our framework uses two techniques: clean sample selection and dual-network architecture. A technical highlight of our approach is the use of soft labels derived from noisy hard labels. By appropriately using the soft and hard labels in the two techniques, we achieve more accurate sample selection and robust network training. The proposed method outperforms various state-of-the-art methods in experiments using two endoscopic ulcerative colitis (UC) datasets and a retinal Diabetic Retinopathy (DR) dataset. Our codes are available at https://github.com/shumpei-takezaki/Self-Relaxed-Joint-Training.
△ Less
Submitted 29 October, 2024;
originally announced October 2024.
-
CLIP-Clique: Graph-based Correspondence Matching Augmented by Vision Language Models for Object-based Global Localization
Authors:
Shigemichi Matsuzaki,
Kazuhito Tanaka,
Kazuhiro Shintani
Abstract:
This letter proposes a method of global localization on a map with semantic object landmarks. One of the most promising approaches for localization on object maps is to use semantic graph matching using landmark descriptors calculated from the distribution of surrounding objects. These descriptors are vulnerable to misclassification and partial observations. Moreover, many existing methods rely on…
▽ More
This letter proposes a method of global localization on a map with semantic object landmarks. One of the most promising approaches for localization on object maps is to use semantic graph matching using landmark descriptors calculated from the distribution of surrounding objects. These descriptors are vulnerable to misclassification and partial observations. Moreover, many existing methods rely on inlier extraction using RANSAC, which is stochastic and sensitive to a high outlier rate. To address the former issue, we augment the correspondence matching using Vision Language Models (VLMs). Landmark discriminability is improved by VLM embeddings, which are independent of surrounding objects. In addition, inliers are estimated deterministically using a graph-theoretic approach. We also incorporate pose calculation using the weighted least squares considering correspondence similarity and observation completeness to improve the robustness. We confirmed improvements in matching and pose estimation accuracy through experiments on ScanNet and TUM datasets.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
REST-HANDS: Rehabilitation with Egocentric Vision Using Smartglasses for Treatment of Hands after Surviving Stroke
Authors:
Wiktor Mucha,
Kentaro Tanaka,
Martin Kampel
Abstract:
Stroke represents the third cause of death and disability worldwide, and is recognised as a significant global health problem. A major challenge for stroke survivors is persistent hand dysfunction, which severely affects the ability to perform daily activities and the overall quality of life. In order to regain their functional hand ability, stroke survivors need rehabilitation therapy. However, t…
▽ More
Stroke represents the third cause of death and disability worldwide, and is recognised as a significant global health problem. A major challenge for stroke survivors is persistent hand dysfunction, which severely affects the ability to perform daily activities and the overall quality of life. In order to regain their functional hand ability, stroke survivors need rehabilitation therapy. However, traditional rehabilitation requires continuous medical support, creating dependency on an overburdened healthcare system. In this paper, we explore the use of egocentric recordings from commercially available smart glasses, specifically RayBan Stories, for remote hand rehabilitation. Our approach includes offline experiments to evaluate the potential of smart glasses for automatic exercise recognition, exercise form evaluation and repetition counting. We present REST-HANDS, the first dataset of egocentric hand exercise videos. Using state-of-the-art methods, we establish benchmarks with high accuracy rates for exercise recognition (98.55%), form evaluation (86.98%), and repetition counting (mean absolute error of 1.33). Our study demonstrates the feasibility of using egocentric video from smart glasses for remote rehabilitation, paving the way for further research.
△ Less
Submitted 30 September, 2024;
originally announced September 2024.
-
CON: Continual Object Navigation via Data-Free Inter-Agent Knowledge Transfer in Unseen and Unfamiliar Places
Authors:
Kouki Terashima,
Daiki Iwata,
Kanji Tanaka
Abstract:
This work explores the potential of brief inter-agent knowledge transfer (KT) to enhance the robotic object goal navigation (ON) in unseen and unfamiliar environments. Drawing on the analogy of human travelers acquiring local knowledge, we propose a framework in which a traveler robot (student) communicates with local robots (teachers) to obtain ON knowledge through minimal interactions. We frame…
▽ More
This work explores the potential of brief inter-agent knowledge transfer (KT) to enhance the robotic object goal navigation (ON) in unseen and unfamiliar environments. Drawing on the analogy of human travelers acquiring local knowledge, we propose a framework in which a traveler robot (student) communicates with local robots (teachers) to obtain ON knowledge through minimal interactions. We frame this process as a data-free continual learning (CL) challenge, aiming to transfer knowledge from a black-box model (teacher) to a new model (student). In contrast to approaches like zero-shot ON using large language models (LLMs), which utilize inherently communication-friendly natural language for knowledge representation, the other two major ON approaches -- frontier-driven methods using object feature maps and learning-based ON using neural state-action maps -- present complex challenges where data-free KT remains largely uncharted. To address this gap, we propose a lightweight, plug-and-play KT module targeting non-cooperative black-box teachers in open-world settings. Using the universal assumption that every teacher robot has vision and mobility capabilities, we define state-action history as the primary knowledge base. Our formulation leads to the development of a query-based occupancy map that dynamically represents target object locations, serving as an effective and communication-friendly knowledge representation. We validate the effectiveness of our method through experiments conducted in the Habitat environment.
△ Less
Submitted 23 September, 2024;
originally announced September 2024.
-
Visuo-Tactile Zero-Shot Object Recognition with Vision-Language Model
Authors:
Shiori Ueda,
Atsushi Hashimoto,
Masashi Hamaya,
Kazutoshi Tanaka,
Hideo Saito
Abstract:
Tactile perception is vital, especially when distinguishing visually similar objects. We propose an approach to incorporate tactile data into a Vision-Language Model (VLM) for visuo-tactile zero-shot object recognition. Our approach leverages the zero-shot capability of VLMs to infer tactile properties from the names of tactilely similar objects. The proposed method translates tactile data into a…
▽ More
Tactile perception is vital, especially when distinguishing visually similar objects. We propose an approach to incorporate tactile data into a Vision-Language Model (VLM) for visuo-tactile zero-shot object recognition. Our approach leverages the zero-shot capability of VLMs to infer tactile properties from the names of tactilely similar objects. The proposed method translates tactile data into a textual description solely by annotating object names for each tactile sequence during training, making it adaptable to various contexts with low training costs. The proposed method was evaluated on the FoodReplica and Cube datasets, demonstrating its effectiveness in recognizing objects that are difficult to distinguish by vision alone.
△ Less
Submitted 13 September, 2024;
originally announced September 2024.
-
Deep Bayesian Active Learning-to-Rank with Relative Annotation for Estimation of Ulcerative Colitis Severity
Authors:
Takeaki Kadota,
Hideaki Hayashi,
Ryoma Bise,
Kiyohito Tanaka,
Seiichi Uchida
Abstract:
Automatic image-based severity estimation is an important task in computer-aided diagnosis. Severity estimation by deep learning requires a large amount of training data to achieve a high performance. In general, severity estimation uses training data annotated with discrete (i.e., quantized) severity labels. Annotating discrete labels is often difficult in images with ambiguous severity, and the…
▽ More
Automatic image-based severity estimation is an important task in computer-aided diagnosis. Severity estimation by deep learning requires a large amount of training data to achieve a high performance. In general, severity estimation uses training data annotated with discrete (i.e., quantized) severity labels. Annotating discrete labels is often difficult in images with ambiguous severity, and the annotation cost is high. In contrast, relative annotation, in which the severity between a pair of images is compared, can avoid quantizing severity and thus makes it easier. We can estimate relative disease severity using a learning-to-rank framework with relative annotations, but relative annotation has the problem of the enormous number of pairs that can be annotated. Therefore, the selection of appropriate pairs is essential for relative annotation. In this paper, we propose a deep Bayesian active learning-to-rank that automatically selects appropriate pairs for relative annotation. Our method preferentially annotates unlabeled pairs with high learning efficiency from the model uncertainty of the samples. We prove the theoretical basis for adapting Bayesian neural networks to pairwise learning-to-rank and demonstrate the efficiency of our method through experiments on endoscopic images of ulcerative colitis on both private and public datasets. We also show that our method achieves a high performance under conditions of significant class imbalance because it automatically selects samples from the minority classes.
△ Less
Submitted 9 September, 2024; v1 submitted 7 September, 2024;
originally announced September 2024.
-
FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation
Authors:
Takuhiro Kaneko,
Hirokazu Kameoka,
Kou Tanaka,
Yuto Kondo
Abstract:
Diffusion-based voice conversion (VC) techniques such as VoiceGrad have attracted interest because of their high VC performance in terms of speech quality and speaker similarity. However, a notable limitation is the slow inference caused by the multi-step reverse diffusion. Therefore, we propose FastVoiceGrad, a novel one-step diffusion-based VC that reduces the number of iterations from dozens to…
▽ More
Diffusion-based voice conversion (VC) techniques such as VoiceGrad have attracted interest because of their high VC performance in terms of speech quality and speaker similarity. However, a notable limitation is the slow inference caused by the multi-step reverse diffusion. Therefore, we propose FastVoiceGrad, a novel one-step diffusion-based VC that reduces the number of iterations from dozens to one while inheriting the high VC performance of the multi-step diffusion-based VC. We obtain the model using adversarial conditional diffusion distillation (ACDD), leveraging the ability of generative adversarial networks and diffusion models while reconsidering the initial states in sampling. Evaluations of one-shot any-to-any VC demonstrate that FastVoiceGrad achieves VC performance superior to or comparable to that of previous multi-step diffusion-based VC while enhancing the inference speed. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/fastvoicegrad/.
△ Less
Submitted 3 September, 2024;
originally announced September 2024.
-
Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits
Authors:
Tatsuhiro Shimizu,
Koichi Tanaka,
Ren Kishimoto,
Haruka Kiyohara,
Masahiro Nomura,
Yuta Saito
Abstract:
We explore off-policy evaluation and learning (OPE/L) in contextual combinatorial bandits (CCB), where a policy selects a subset in the action space. For example, it might choose a set of furniture pieces (a bed and a drawer) from available items (bed, drawer, chair, etc.) for interior design sales. This setting is widespread in fields such as recommender systems and healthcare, yet OPE/L of CCB r…
▽ More
We explore off-policy evaluation and learning (OPE/L) in contextual combinatorial bandits (CCB), where a policy selects a subset in the action space. For example, it might choose a set of furniture pieces (a bed and a drawer) from available items (bed, drawer, chair, etc.) for interior design sales. This setting is widespread in fields such as recommender systems and healthcare, yet OPE/L of CCB remains unexplored in the relevant literature. Typical OPE/L methods such as regression and importance sampling can be applied to the CCB problem, however, they face significant challenges due to high bias or variance, exacerbated by the exponential growth in the number of available subsets. To address these challenges, we introduce a concept of factored action space, which allows us to decompose each subset into binary indicators. This formulation allows us to distinguish between the ''main effect'' derived from the main actions, and the ''residual effect'', originating from the supplemental actions, facilitating more effective OPE. Specifically, our estimator, called OPCB, leverages an importance sampling-based approach to unbiasedly estimate the main effect, while employing regression-based approach to deal with the residual effect with low variance. OPCB achieves substantial variance reduction compared to conventional importance sampling methods and bias reduction relative to regression methods under certain conditions, as illustrated in our theoretical analysis. Experiments demonstrate OPCB's superior performance over typical methods in both OPE and OPL.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
Leveraging Language Models for Emotion and Behavior Analysis in Education
Authors:
Kaito Tanaka,
Benjamin Tan,
Brian Wong
Abstract:
The analysis of students' emotions and behaviors is crucial for enhancing learning outcomes and personalizing educational experiences. Traditional methods often rely on intrusive visual and physiological data collection, posing privacy concerns and scalability issues. This paper proposes a novel method leveraging large language models (LLMs) and prompt engineering to analyze textual data from stud…
▽ More
The analysis of students' emotions and behaviors is crucial for enhancing learning outcomes and personalizing educational experiences. Traditional methods often rely on intrusive visual and physiological data collection, posing privacy concerns and scalability issues. This paper proposes a novel method leveraging large language models (LLMs) and prompt engineering to analyze textual data from students. Our approach utilizes tailored prompts to guide LLMs in detecting emotional and engagement states, providing a non-intrusive and scalable solution. We conducted experiments using Qwen, ChatGPT, Claude2, and GPT-4, comparing our method against baseline models and chain-of-thought (CoT) prompting. Results demonstrate that our method significantly outperforms the baselines in both accuracy and contextual understanding. This study highlights the potential of LLMs combined with prompt engineering to offer practical and effective tools for educational emotion and behavior analysis.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
Are Social Sentiments Inherent in LLMs? An Empirical Study on Extraction of Inter-demographic Sentiments
Authors:
Kunitomo Tanaka,
Ryohei Sasano,
Koichi Takeda
Abstract:
Large language models (LLMs) are supposed to acquire unconscious human knowledge and feelings, such as social common sense and biases, by training models from large amounts of text. However, it is not clear how much the sentiments of specific social groups can be captured in various LLMs. In this study, we focus on social groups defined in terms of nationality, religion, and race/ethnicity, and va…
▽ More
Large language models (LLMs) are supposed to acquire unconscious human knowledge and feelings, such as social common sense and biases, by training models from large amounts of text. However, it is not clear how much the sentiments of specific social groups can be captured in various LLMs. In this study, we focus on social groups defined in terms of nationality, religion, and race/ethnicity, and validate the extent to which sentiments between social groups can be captured in and extracted from LLMs. Specifically, we input questions regarding sentiments from one group to another into LLMs, apply sentiment analysis to the responses, and compare the results with social surveys. The validation results using five representative LLMs showed higher correlations with relatively small p-values for nationalities and religions, whose number of data points were relatively large. This result indicates that the LLM responses including the inter-group sentiments align well with actual social survey results.
△ Less
Submitted 8 August, 2024;
originally announced August 2024.
-
Token-based Decision Criteria Are Suboptimal in In-context Learning
Authors:
Hakaze Cho,
Yoshihiro Sakai,
Mariko Kato,
Kenshiro Tanaka,
Akira Ishii,
Naoya Inoue
Abstract:
In-Context Learning (ICL) typically utilizes classification criteria from output probabilities of manually selected label tokens. However, we argue that such token-based classification criteria lead to suboptimal decision boundaries, despite delicate calibrations through translation and constrained rotation applied. To address this problem, we propose Hidden Calibration, which renounces token prob…
▽ More
In-Context Learning (ICL) typically utilizes classification criteria from output probabilities of manually selected label tokens. However, we argue that such token-based classification criteria lead to suboptimal decision boundaries, despite delicate calibrations through translation and constrained rotation applied. To address this problem, we propose Hidden Calibration, which renounces token probabilities and uses the nearest centroid classifier on the LM's last hidden states. In detail, we assign the label of the nearest centroid previously estimated from a calibration set to the test sample as the predicted label. Our experiments on 6 models and 10 classification datasets indicate that Hidden Calibration consistently outperforms current token-based baselines by about 20%~50%, achieving a strong state-of-the-art in ICL. Our further analysis demonstrates that Hidden Calibration finds better classification criteria with less inter-class overlap, and LMs provide linearly separable intra-class clusters with the help of demonstrations, which supports Hidden Calibration and gives new insights into the principle of ICL. Our official code implementation can be found at https://github.com/hc495/Hidden_Calibration.
△ Less
Submitted 5 February, 2025; v1 submitted 24 June, 2024;
originally announced June 2024.
-
DRIP: Discriminative Rotation-Invariant Pole Landmark Descriptor for 3D LiDAR Localization
Authors:
Dingrui Li,
Dedi Guo,
Kanji Tanaka
Abstract:
In 3D LiDAR-based robot self-localization, pole-like landmarks are gaining popularity as lightweight and discriminative landmarks. This work introduces a novel approach called "discriminative rotation-invariant poles," which enhances the discriminability of pole-like landmarks while maintaining their lightweight nature. Unlike conventional methods that model a pole landmark as a 3D line segment pe…
▽ More
In 3D LiDAR-based robot self-localization, pole-like landmarks are gaining popularity as lightweight and discriminative landmarks. This work introduces a novel approach called "discriminative rotation-invariant poles," which enhances the discriminability of pole-like landmarks while maintaining their lightweight nature. Unlike conventional methods that model a pole landmark as a 3D line segment perpendicular to the ground, we propose a simple yet powerful approach that includes not only the line segment's main body but also its surrounding local region of interest (ROI) as part of the pole landmark. Specifically, we describe the appearance, geometry, and semantic features within this ROI to improve the discriminability of the pole landmark. Since such pole landmarks are no longer rotation-invariant, we introduce a novel rotation-invariant convolutional neural network that automatically and efficiently extracts rotation-invariant features from input point clouds for recognition. Furthermore, we train a pole dictionary through unsupervised learning and use it to compress poles into compact pole words, thereby significantly reducing real-time costs while maintaining optimal self-localization performance. Monte Carlo localization experiments using publicly available NCLT dataset demonstrate that the proposed method improves a state-of-the-art pole-based localization framework.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Understanding Token Probability Encoding in Output Embeddings
Authors:
Hakaze Cho,
Yoshihiro Sakai,
Kenshiro Tanaka,
Mariko Kato,
Naoya Inoue
Abstract:
In this paper, we investigate the output token probability information in the output embedding of language models. We find an approximate common log-linear encoding of output token probabilities within the output embedding vectors and empirically demonstrate that it is accurate and sparse. As a causality examination, we steer the encoding in output embedding to modify the output probability distri…
▽ More
In this paper, we investigate the output token probability information in the output embedding of language models. We find an approximate common log-linear encoding of output token probabilities within the output embedding vectors and empirically demonstrate that it is accurate and sparse. As a causality examination, we steer the encoding in output embedding to modify the output probability distribution accurately. Moreover, the sparsity we find in output probability encoding suggests that a large number of dimensions in the output embedding do not contribute to causal language modeling. Therefore, we attempt to delete the output-unrelated dimensions and find more than 30% of the dimensions can be deleted without significant movement in output distribution and sequence generation. Additionally, in the pre-training dynamics of language models, we find that the output embeddings capture the corpus token frequency information in early steps, even before an obvious convergence of parameters starts.
△ Less
Submitted 11 December, 2024; v1 submitted 3 June, 2024;
originally announced June 2024.
-
Zero-shot Degree of Ill-posedness Estimation for Active Small Object Change Detection
Authors:
Koji Takeda,
Kanji Tanaka,
Yoshimasa Nakamura,
Asako Kanezaki
Abstract:
In everyday indoor navigation, robots often needto detect non-distinctive small-change objects (e.g., stationery,lost items, and junk, etc.) to maintain domain knowledge. Thisis most relevant to ground-view change detection (GVCD), a recently emerging research area in the field of computer vision.However, these existing techniques rely on high-quality class-specific object priors to regularize a c…
▽ More
In everyday indoor navigation, robots often needto detect non-distinctive small-change objects (e.g., stationery,lost items, and junk, etc.) to maintain domain knowledge. Thisis most relevant to ground-view change detection (GVCD), a recently emerging research area in the field of computer vision.However, these existing techniques rely on high-quality class-specific object priors to regularize a change detector modelthat cannot be applied to semantically nondistinctive smallobjects. To address ill-posedness, in this study, we explorethe concept of degree-of-ill-posedness (DoI) from the newperspective of GVCD, aiming to improve both passive and activevision. This novel DoI problem is highly domain-dependent,and manually collecting fine-grained annotated training datais expensive. To regularize this problem, we apply the conceptof self-supervised learning to achieve efficient DoI estimationscheme and investigate its generalization to diverse datasets.Specifically, we tackle the challenging issue of obtaining self-supervision cues for semantically non-distinctive unseen smallobjects and show that novel "oversegmentation cues" from openvocabulary semantic segmentation can be effectively exploited.When applied to diverse real datasets, the proposed DoI modelcan boost state-of-the-art change detection models, and it showsstable and consistent improvements when evaluated on real-world datasets.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
Deep Learning for Video-Based Assessment of Endotracheal Intubation Skills
Authors:
Jean-Paul Ainam,
Erim Yanik,
Rahul Rahul,
Taylor Kunkes,
Lora Cavuoto,
Brian Clemency,
Kaori Tanaka,
Matthew Hackett,
Jack Norfleet,
Suvranu De
Abstract:
Endotracheal intubation (ETI) is an emergency procedure performed in civilian and combat casualty care settings to establish an airway. Objective and automated assessment of ETI skills is essential for the training and certification of healthcare providers. However, the current approach is based on manual feedback by an expert, which is subjective, time- and resource-intensive, and is prone to poo…
▽ More
Endotracheal intubation (ETI) is an emergency procedure performed in civilian and combat casualty care settings to establish an airway. Objective and automated assessment of ETI skills is essential for the training and certification of healthcare providers. However, the current approach is based on manual feedback by an expert, which is subjective, time- and resource-intensive, and is prone to poor inter-rater reliability and halo effects. This work proposes a framework to evaluate ETI skills using single and multi-view videos. The framework consists of two stages. First, a 2D convolutional autoencoder (AE) and a pre-trained self-supervision network extract features from videos. Second, a 1D convolutional enhanced with a cross-view attention module takes the features from the AE as input and outputs predictions for skill evaluation. The ETI datasets were collected in two phases. In the first phase, ETI is performed by two subject cohorts: Experts and Novices. In the second phase, novice subjects perform ETI under time pressure, and the outcome is either Successful or Unsuccessful. A third dataset of videos from a single head-mounted camera for Experts and Novices is also analyzed. The study achieved an accuracy of 100% in identifying Expert/Novice trials in the initial phase. In the second phase, the model showed 85% accuracy in classifying Successful/Unsuccessful procedures. Using head-mounted cameras alone, the model showed a 96% accuracy on Expert and Novice classification while maintaining an accuracy of 85% on classifying successful and unsuccessful. In addition, GradCAMs are presented to explain the differences between Expert and Novice behavior and Successful and Unsuccessful trials. The approach offers a reliable and objective method for automated assessment of ETI skills.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
1-out-of-n Oblivious Signatures: Security Revisited and a Generic Construction with an Efficient Communication Cost
Authors:
Masayuki Tezuka,
Keisuke Tanaka
Abstract:
1-out-of-n oblivious signature by Chen (ESORIC 1994) is a protocol between the user and the signer. In this scheme, the user makes a list of n messages and chooses the message that the user wants to obtain a signature from the list. The user interacts with the signer by providing this message list and obtains the signature for only the chosen message without letting the signer identify which messa…
▽ More
1-out-of-n oblivious signature by Chen (ESORIC 1994) is a protocol between the user and the signer. In this scheme, the user makes a list of n messages and chooses the message that the user wants to obtain a signature from the list. The user interacts with the signer by providing this message list and obtains the signature for only the chosen message without letting the signer identify which messages the user chooses. Tso et al. (ISPEC 2008) presented a formal treatment of 1-out-of-n oblivious signatures. They defined unforgeability and ambiguity for 1-out-of-n oblivious signatures as a security requirement. In this work, first, we revisit the unforgeability security definition by Tso et al. and point out that their security definition has problems. We address these problems by modifying their security model and redefining unforgeable security. Second, we improve the generic construction of a 1-out-of-n oblivious signature scheme by Zhou et al. (IEICE Trans 2022). We reduce the communication cost by modifying their scheme with a Merkle tree. Then we prove the security of our modified scheme.
△ Less
Submitted 31 March, 2024;
originally announced April 2024.
-
Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator
Authors:
Takuhiro Kaneko,
Hirokazu Kameoka,
Kou Tanaka
Abstract:
A generative adversarial network (GAN)-based vocoder trained with an adversarial discriminator is commonly used for speech synthesis because of its fast, lightweight, and high-quality characteristics. However, this data-driven model requires a large amount of training data incurring high data-collection costs. This fact motivates us to train a GAN-based vocoder on limited data. A promising solutio…
▽ More
A generative adversarial network (GAN)-based vocoder trained with an adversarial discriminator is commonly used for speech synthesis because of its fast, lightweight, and high-quality characteristics. However, this data-driven model requires a large amount of training data incurring high data-collection costs. This fact motivates us to train a GAN-based vocoder on limited data. A promising solution is to augment the training data to avoid overfitting. However, a standard discriminator is unconditional and insensitive to distributional changes caused by data augmentation. Thus, augmented speech (which can be extraordinary) may be considered real speech. To address this issue, we propose an augmentation-conditional discriminator (AugCondD) that receives the augmentation state as input in addition to speech, thereby assessing the input speech according to the augmentation state, without inhibiting the learning of the original non-augmented distribution. Experimental results indicate that AugCondD improves speech quality under limited data conditions while achieving comparable speech quality under sufficient data conditions. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/augcondd/.
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
ProgrammableGrass: A Shape-Changing Artificial Grass Display Adapted for Dynamic and Interactive Display Features
Authors:
Kojiro Tanaka,
Akito Mizuno,
Toranosuke Kato,
Masahiko Mikawa,
Makoto Fujisawa
Abstract:
There are various proposals for employing grass materials as a green landscape-friendly display. However, it is difficult for current techniques to display smooth animations using 8-bit images and to adjust display resolution, similar to conventional displays. We present ProgrammableGrass, an artificial grass display with scalable resolution, capable of swiftly controlling grass color at 8-bit lev…
▽ More
There are various proposals for employing grass materials as a green landscape-friendly display. However, it is difficult for current techniques to display smooth animations using 8-bit images and to adjust display resolution, similar to conventional displays. We present ProgrammableGrass, an artificial grass display with scalable resolution, capable of swiftly controlling grass color at 8-bit levels. This grass display can control grass colors linearly at the 8-bit level, similar to an LCD display, and can also display not only 8-bit-based images but also videos. This display enables pixel-by-pixel color transitions from yellow to green using fixed-length yellow and adjustable-length green grass. We designed a grass module that can be connected to other modules. Utilizing a proportional derivative control, the grass colors are manipulated to display animations at approximately 10 [fps]. Since the relationship between grass lengths and colors is nonlinear, we developed a calibration system for ProgrammableGrass. We revealed that this calibration system allows ProgrammableGrass to linearly control grass colors at 8-bit levels through experiments under multiple conditions. Lastly, we demonstrate ProgrammableGrass to show smooth animations with 8-bit grayscale images. Moreover, we show several application examples to illustrate the potential of ProgrammableGrass. With the advancement of this technology, users will be able to treat grass as a green-based interactive display device.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
Training Self-localization Models for Unseen Unfamiliar Places via Teacher-to-Student Data-Free Knowledge Transfer
Authors:
Kenta Tsukahara,
Kanji Tanaka,
Daiki Iwata
Abstract:
A typical assumption in state-of-the-art self-localization models is that an annotated training dataset is available in the target workspace. However, this does not always hold when a robot travels in a general open-world. This study introduces a novel training scheme for open-world distributed robot systems. In our scheme, a robot ("student") can ask the other robots it meets at unfamiliar places…
▽ More
A typical assumption in state-of-the-art self-localization models is that an annotated training dataset is available in the target workspace. However, this does not always hold when a robot travels in a general open-world. This study introduces a novel training scheme for open-world distributed robot systems. In our scheme, a robot ("student") can ask the other robots it meets at unfamiliar places ("teachers") for guidance. Specifically, a pseudo-training dataset is reconstructed from the teacher model and thereafter used for continual learning of the student model. Unlike typical knowledge transfer schemes, our scheme introduces only minimal assumptions on the teacher model, such that it can handle various types of open-set teachers, including uncooperative, untrainable (e.g., image retrieval engines), and blackbox teachers (i.e., data privacy). Rather than relying on the availability of private data of teachers as in existing methods, we propose to exploit an assumption that holds universally in self-localization tasks: "The teacher model is a self-localization system" and to reuse the self-localization system of a teacher as a sole accessible communication channel. We particularly focus on designing an excellent student/questioner whose interactions with teachers can yield effective question-and-answer sequences that can be used as pseudo-training datasets for the student self-localization model. When applied to a generic recursive knowledge distillation scenario, our approach exhibited stable and consistent performance improvement.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
Swarm Body: Embodied Swarm Robots
Authors:
Sosuke Ichihashi,
So Kuroki,
Mai Nishimura,
Kazumi Kasaura,
Takefumi Hiraki,
Kazutoshi Tanaka,
Shigeo Yoshida
Abstract:
The human brain's plasticity allows for the integration of artificial body parts into the human body. Leveraging this, embodied systems realize intuitive interactions with the environment. We introduce a novel concept: embodied swarm robots. Swarm robots constitute a collective of robots working in harmony to achieve a common objective, in our case, serving as functional body parts. Embodied swarm…
▽ More
The human brain's plasticity allows for the integration of artificial body parts into the human body. Leveraging this, embodied systems realize intuitive interactions with the environment. We introduce a novel concept: embodied swarm robots. Swarm robots constitute a collective of robots working in harmony to achieve a common objective, in our case, serving as functional body parts. Embodied swarm robots can dynamically alter their shape, density, and the correspondences between body parts and individual robots. We contribute an investigation of the influence on embodiment of swarm robot-specific factors derived from these characteristics, focusing on a hand. Our paper is the first to examine these factors through virtual reality (VR) and real-world robot studies to provide essential design considerations and applications of embodied swarm robots. Through quantitative and qualitative analysis, we identified a system configuration to achieve the embodiment of swarm robots.
△ Less
Submitted 29 February, 2024; v1 submitted 24 February, 2024;
originally announced February 2024.
-
CLIP-Loc: Multi-modal Landmark Association for Global Localization in Object-based Maps
Authors:
Shigemichi Matsuzaki,
Takuma Sugino,
Kazuhito Tanaka,
Zijun Sha,
Shintaro Nakaoka,
Shintaro Yoshizawa,
Kazuhiro Shintani
Abstract:
This paper describes a multi-modal data association method for global localization using object-based maps and camera images. In global localization, or relocalization, using object-based maps, existing methods typically resort to matching all possible combinations of detected objects and landmarks with the same object category, followed by inlier extraction using RANSAC or brute-force search. Thi…
▽ More
This paper describes a multi-modal data association method for global localization using object-based maps and camera images. In global localization, or relocalization, using object-based maps, existing methods typically resort to matching all possible combinations of detected objects and landmarks with the same object category, followed by inlier extraction using RANSAC or brute-force search. This approach becomes infeasible as the number of landmarks increases due to the exponential growth of correspondence candidates. In this paper, we propose labeling landmarks with natural language descriptions and extracting correspondences based on conceptual similarity with image observations using a Vision Language Model (VLM). By leveraging detailed text information, our approach efficiently extracts correspondences compared to methods using only object categories. Through experiments, we demonstrate that the proposed method enables more accurate global localization with fewer iterations compared to baseline methods, exhibiting its efficiency.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation
Authors:
Kohei Uehara,
Nabarun Goswami,
Hanqin Wang,
Toshiaki Baba,
Kohtaro Tanaka,
Tomohiro Hashimoto,
Kai Wang,
Rei Ito,
Takagi Naoya,
Ryo Umagami,
Yingyi Wen,
Tanachai Anakewat,
Tatsuya Harada
Abstract:
The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of large Vision-and-Language Models (VLMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to develop a VLM with the ability to conduct explicit reasoning based on visual content and textual instructions. We…
▽ More
The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of large Vision-and-Language Models (VLMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to develop a VLM with the ability to conduct explicit reasoning based on visual content and textual instructions. We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process. To this end, we developed a novel dataset generated by a Large Language Model (LLM), designed to promote chain-of-thought reasoning combined with a question-asking mechanism. The dataset covers a range of tasks, from common ones like caption generation to specialized VQA tasks that require expert knowledge. Furthermore, using the dataset we created, we fine-tuned an existing VLM. This training enabled the models to generate questions and perform iterative reasoning during inference. The results demonstrated a stride toward a more robust, accurate, and interpretable VLM, capable of reasoning explicitly and seeking information proactively when confronted with ambiguous visual input.
△ Less
Submitted 17 July, 2024; v1 submitted 18 January, 2024;
originally announced January 2024.
-
Data assimilation approach for addressing imperfections in people flow measurement techniques using particle filter
Authors:
Ryo Murata,
Kenji Tanaka
Abstract:
Understanding and predicting people flow in urban areas is useful for decision-making in urban planning and marketing strategies. Traditional methods for understanding people flow can be divided into measurement-based approaches and simulation-based approaches. Measurement-based approaches have the advantage of directly capturing actual people flow, but they face the challenge of data imperfection…
▽ More
Understanding and predicting people flow in urban areas is useful for decision-making in urban planning and marketing strategies. Traditional methods for understanding people flow can be divided into measurement-based approaches and simulation-based approaches. Measurement-based approaches have the advantage of directly capturing actual people flow, but they face the challenge of data imperfection. On the other hand, simulations can obtain complete data on a computer, but they only consider some of the factors determining human behavior, leading to a divergence from actual people flow. Both measurement and simulation methods have unresolved issues, and combining the two can complementarily overcome them. This paper proposes a method that applies data assimilation, a fusion technique of measurement and simulation, to agent-based simulation. Data assimilation combines the advantages of both measurement and simulation, contributing to the creation of an environment that can reflect real people flow while acquiring richer data. The paper verifies the effectiveness of the proposed method in a virtual environment and demonstrates the potential of data assimilation to compensate for the three types of imperfection in people flow measurement techniques. These findings can serve as guidelines for supplementing sparse measurement data in physical environments.
△ Less
Submitted 17 January, 2024;
originally announced January 2024.
-
Polygonal Sequence-driven Triangulation Validator: An Incremental Approach to 2D Triangulation Verification
Authors:
Sora Sawai,
Kazuaki Tanaka,
Katsuhisa Ozaki,
Shin'ichi Oishi
Abstract:
Two-dimensional Delaunay triangulation is a fundamental aspect of computational geometry. This paper presents a novel algorithm that is specifically designed to ensure the correctness of 2D Delaunay triangulation, namely the Polygonal Sequence-driven Triangulation Validator (PSTV). Our research highlights the paramount importance of proper triangulation and the often overlooked, yet profound, impa…
▽ More
Two-dimensional Delaunay triangulation is a fundamental aspect of computational geometry. This paper presents a novel algorithm that is specifically designed to ensure the correctness of 2D Delaunay triangulation, namely the Polygonal Sequence-driven Triangulation Validator (PSTV). Our research highlights the paramount importance of proper triangulation and the often overlooked, yet profound, impact of rounding errors in numerical computations on the precision of triangulation. The primary objective of the PSTV algorithm is to identify these computational errors and ensure the accuracy of the triangulation output. In addition to validating the correctness of triangulation, this study underscores the significance of the Delaunay property for the quality of finite element methods. Effective strategies are proposed to verify this property for a triangulation and correct it when necessary. While acknowledging the difficulty of rectifying complex triangulation errors such as overlapping triangles, these strategies provide valuable insights on identifying the locations of these errors and remedying them. The unique feature of the PSTV algorithm lies in its adoption of floating-point filters in place of interval arithmetic, striking an effective balance between computational efficiency and precision. This research sets a vital precedent for error reduction and precision enhancement in computational geometry.
△ Less
Submitted 16 January, 2024;
originally announced January 2024.
-
Sensor Data Simulation for Anomaly Detection of the Elderly Living Alone
Authors:
Kai Tanaka,
Mineichi Kudo,
Keigo Kimura
Abstract:
With the increase of the number of elderly people living alone around the world, there is a growing demand for sensor-based detection of anomalous behaviors. Although smart homes with ambient sensors could be useful for detecting such anomalies, there is a problem of lack of sufficient real data for developing detection algorithms. For coping with this problem, several sensor data simulators have…
▽ More
With the increase of the number of elderly people living alone around the world, there is a growing demand for sensor-based detection of anomalous behaviors. Although smart homes with ambient sensors could be useful for detecting such anomalies, there is a problem of lack of sufficient real data for developing detection algorithms. For coping with this problem, several sensor data simulators have been proposed, but they have not been able to model appropriately the long-term transitions and correlations between anomalies that exist in reality. In this paper, therefore, we propose a novel sensor data simulator that can model these factors in generation of sensor data. Anomalies considered in this study were classified into three types of \textit{state anomalies}, \textit{activity anomalies}, and \textit{moving anomalies}. The simulator produces 10 years data in 100 min. including six anomalies, two for each type. Numerical evaluations show that this simulator is superior to the past simulators in the sense that it simulates well day-to-day variations of real data.
△ Less
Submitted 28 December, 2023;
originally announced December 2023.
-
Recursive Distillation for Open-Set Distributed Robot Localization
Authors:
Kenta Tsukahara,
Kanji Tanaka
Abstract:
A typical assumption in state-of-the-art self-localization models is that an annotated training dataset is available for the target workspace. However, this is not necessarily true when a robot travels around the general open world. This work introduces a novel training scheme for open-world distributed robot systems. In our scheme, a robot (``student") can ask the other robots it meets at unfamil…
▽ More
A typical assumption in state-of-the-art self-localization models is that an annotated training dataset is available for the target workspace. However, this is not necessarily true when a robot travels around the general open world. This work introduces a novel training scheme for open-world distributed robot systems. In our scheme, a robot (``student") can ask the other robots it meets at unfamiliar places (``teachers") for guidance. Specifically, a pseudo-training dataset is reconstructed from the teacher model and then used for continual learning of the student model under domain, class, and vocabulary incremental setup. Unlike typical knowledge transfer schemes, our scheme introduces only minimal assumptions on the teacher model, so that it can handle various types of open-set teachers, including those uncooperative, untrainable (e.g., image retrieval engines), or black-box teachers (i.e., data privacy). In this paper, we investigate a ranking function as an instance of such generic models, using a challenging data-free recursive distillation scenario, where a student once trained can recursively join the next-generation open teacher set.
△ Less
Submitted 26 September, 2024; v1 submitted 26 December, 2023;
originally announced December 2023.
-
Vision-Language Interpreter for Robot Task Planning
Authors:
Keisuke Shirai,
Cristian C. Beltran-Hernandez,
Masashi Hamaya,
Atsushi Hashimoto,
Shohei Tanaka,
Kento Kawaharazuka,
Kazutoshi Tanaka,
Yoshitaka Ushiku,
Shinsuke Mori
Abstract:
Large language models (LLMs) are accelerating the development of language-guided robot planners. Meanwhile, symbolic planners offer the advantage of interpretability. This paper proposes a new task that bridges these two trends, namely, multimodal planning problem specification. The aim is to generate a problem description (PD), a machine-readable file used by the planners to find a plan. By gener…
▽ More
Large language models (LLMs) are accelerating the development of language-guided robot planners. Meanwhile, symbolic planners offer the advantage of interpretability. This paper proposes a new task that bridges these two trends, namely, multimodal planning problem specification. The aim is to generate a problem description (PD), a machine-readable file used by the planners to find a plan. By generating PDs from language instruction and scene observation, we can drive symbolic planners in a language-guided framework. We propose a Vision-Language Interpreter (ViLaIn), a new framework that generates PDs using state-of-the-art LLM and vision-language models. ViLaIn can refine generated PDs via error message feedback from the symbolic planner. Our aim is to answer the question: How accurately can ViLaIn and the symbolic planner generate valid robot plans? To evaluate ViLaIn, we introduce a novel dataset called the problem description generation (ProDG) dataset. The framework is evaluated with four new evaluation metrics. Experimental results show that ViLaIn can generate syntactically correct problems with more than 99\% accuracy and valid plans with more than 58\% accuracy. Our code and dataset are available at https://github.com/omron-sinicx/ViLaIn.
△ Less
Submitted 19 February, 2024; v1 submitted 1 November, 2023;
originally announced November 2023.
-
Cross-view Self-localization from Synthesized Scene-graphs
Authors:
Ryogo Yamamoto,
Kanji Tanaka
Abstract:
Cross-view self-localization is a challenging scenario of visual place recognition in which database images are provided from sparse viewpoints. Recently, an approach for synthesizing database images from unseen viewpoints using NeRF (Neural Radiance Fields) technology has emerged with impressive performance. However, synthesized images provided by these techniques are often of lower quality than…
▽ More
Cross-view self-localization is a challenging scenario of visual place recognition in which database images are provided from sparse viewpoints. Recently, an approach for synthesizing database images from unseen viewpoints using NeRF (Neural Radiance Fields) technology has emerged with impressive performance. However, synthesized images provided by these techniques are often of lower quality than the original images, and furthermore they significantly increase the storage cost of the database. In this study, we explore a new hybrid scene model that combines the advantages of view-invariant appearance features computed from raw images and view-dependent spatial-semantic features computed from synthesized images. These two types of features are then fused into scene graphs, and compressively learned and recognized by a graph neural network. The effectiveness of the proposed method was verified using a novel cross-view self-localization dataset with many unseen views generated using a photorealistic Habitat simulator.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
Multimodal Active Measurement for Human Mesh Recovery in Close Proximity
Authors:
Takahiro Maeda,
Keisuke Takeshita,
Norimichi Ukita,
Kazuhito Tanaka
Abstract:
For physical human-robot interactions (pHRI), a robot needs to estimate the accurate body pose of a target person. However, in these pHRI scenarios, the robot cannot fully observe the target person's body with equipped cameras because the target person must be close to the robot for physical interaction. This close distance leads to severe truncation and occlusions and thus results in poor accurac…
▽ More
For physical human-robot interactions (pHRI), a robot needs to estimate the accurate body pose of a target person. However, in these pHRI scenarios, the robot cannot fully observe the target person's body with equipped cameras because the target person must be close to the robot for physical interaction. This close distance leads to severe truncation and occlusions and thus results in poor accuracy of human pose estimation. For better accuracy in this challenging environment, we propose an active measurement and sensor fusion framework of the equipped cameras with touch and ranging sensors such as 2D LiDAR. Touch and ranging sensor measurements are sparse but reliable and informative cues for localizing human body parts. In our active measurement process, camera viewpoints and sensor placements are dynamically optimized to measure body parts with higher estimation uncertainty, which is closely related to truncation or occlusion. In our sensor fusion process, assuming that the measurements of touch and ranging sensors are more reliable than the camera-based estimations, we fuse the sensor measurements to the camera-based estimated pose by aligning the estimated pose towards the measured points. Our proposed method outperformed previous methods on the standard occlusion benchmark with simulated active measurement. Furthermore, our method reliably estimated human poses using a real robot, even with practical constraints such as occlusion by blankets.
△ Less
Submitted 8 October, 2024; v1 submitted 12 October, 2023;
originally announced October 2023.