-
Shape Reconstruction from Thoracoscopic Images using Self-supervised Virtual Learning
Authors:
Tomoki Oya,
Megumi Nakao,
Tetsuya Matsuda
Abstract:
Intraoperative shape reconstruction of organs from endoscopic camera images is a complex yet indispensable technique for image-guided surgery. To address the uncertainty in reconstructing entire shapes from single-viewpoint occluded images, we propose a framework for generative virtual learning of shape reconstruction using image translation with common latent variables between simulated and real…
▽ More
Intraoperative shape reconstruction of organs from endoscopic camera images is a complex yet indispensable technique for image-guided surgery. To address the uncertainty in reconstructing entire shapes from single-viewpoint occluded images, we propose a framework for generative virtual learning of shape reconstruction using image translation with common latent variables between simulated and real images. As it is difficult to prepare sufficient amount of data to learn the relationship between endoscopic images and organ shapes, self-supervised virtual learning is performed using simulated images generated from statistical shape models. However, small differences between virtual and real images can degrade the estimation performance even if the simulated images are regarded as equivalent by humans. To address this issue, a Variational Autoencoder is used to convert real and simulated images into identical synthetic images. In this study, we targeted the shape reconstruction of collapsed lungs from thoracoscopic images and confirmed that virtual learning could improve the similarity between real and simulated images. Furthermore, shape reconstruction error could be improved by 16.9%.
△ Less
Submitted 25 January, 2023;
originally announced January 2023.
-
The Sound of Bounding-Boxes
Authors:
Takashi Oya,
Shohei Iwase,
Shigeo Morishima
Abstract:
In the task of audio-visual sound source separation, which leverages visual information for sound source separation, identifying objects in an image is a crucial step prior to separating the sound source. However, existing methods that assign sound on detected bounding boxes suffer from a problem that their approach heavily relies on pre-trained object detectors. Specifically, when using these exi…
▽ More
In the task of audio-visual sound source separation, which leverages visual information for sound source separation, identifying objects in an image is a crucial step prior to separating the sound source. However, existing methods that assign sound on detected bounding boxes suffer from a problem that their approach heavily relies on pre-trained object detectors. Specifically, when using these existing methods, it is required to predetermine all the possible categories of objects that can produce sound and use an object detector applicable to all such categories. To tackle this problem, we propose a fully unsupervised method that learns to detect objects in an image and separate sound source simultaneously. As our method does not rely on any pre-trained detector, our method is applicable to arbitrary categories without any additional annotation. Furthermore, although being fully unsupervised, we found that our method performs comparably in separation accuracy.
△ Less
Submitted 29 March, 2022;
originally announced March 2022.
-
Do We Need Sound for Sound Source Localization?
Authors:
Takashi Oya,
Shohei Iwase,
Ryota Natsume,
Takahiro Itazuri,
Shugo Yamaguchi,
Shigeo Morishima
Abstract:
During the performance of sound source localization which uses both visual and aural information, it presently remains unclear how much either image or sound modalities contribute to the result, i.e. do we need both image and sound for sound source localization? To address this question, we develop an unsupervised learning system that solves sound source localization by decomposing this task into…
▽ More
During the performance of sound source localization which uses both visual and aural information, it presently remains unclear how much either image or sound modalities contribute to the result, i.e. do we need both image and sound for sound source localization? To address this question, we develop an unsupervised learning system that solves sound source localization by decomposing this task into two steps: (i) "potential sound source localization", a step that localizes possible sound sources using only visual information (ii) "object selection", a step that identifies which objects are actually sounding using aural information. Our overall system achieves state-of-the-art performance in sound source localization, and more importantly, we find that despite the constraint on available information, the results of (i) achieve similar performance. From this observation and further experiments, we show that visual information is dominant in "sound" source localization when evaluated with the currently adopted benchmark dataset. Moreover, we show that the majority of sound-producing objects within the samples in this dataset can be inherently identified using only visual information, and thus that the dataset is inadequate to evaluate a system's capability to leverage aural information. As an alternative, we present an evaluation protocol that enforces both visual and aural information to be leveraged, and verify this property through several experiments.
△ Less
Submitted 11 July, 2020;
originally announced July 2020.