Search | arXiv e-print repository

Size-Variable Virtual Try-On with Physical Clothes Size

Authors: Yohei Yamashita, Chihiro Nakatani, Norimichi Ukita

Abstract: This paper addresses a new virtual try-on problem of fitting any size of clothes to a reference person in the image domain. While previous image-based virtual try-on methods can produce highly natural try-on images, these methods fit the clothes on the person without considering the relative relationship between the physical sizes of the clothes and the person. Different from these methods, our me… ▽ More This paper addresses a new virtual try-on problem of fitting any size of clothes to a reference person in the image domain. While previous image-based virtual try-on methods can produce highly natural try-on images, these methods fit the clothes on the person without considering the relative relationship between the physical sizes of the clothes and the person. Different from these methods, our method achieves size-variable virtual try-on in which the image size of the try-on clothes is changed depending on this relative relationship of the physical sizes. To relieve the difficulty in maintaining the physical size of the closes while synthesizing the high-fidelity image of the whole clothes, our proposed method focuses on the residual between the silhouettes of the clothes in the reference and try-on images. We also develop a size-variable virtual try-on dataset consisting of 1,524 images provided by 26 subjects. Furthermore, we propose an evaluation metric for size-variable virtual-try-on. Quantitative and qualitative experimental results show that our method can achieve size-variable virtual try-on better than general virtual try-on methods. △ Less

Submitted 8 December, 2024; originally announced December 2024.

arXiv:2410.20735 [pdf]

doi 10.1038/s41598-025-10912-3

Murine AI excels at cats and cheese: Structural differences between human and mouse neurons and their implementation in generative AIs

Authors: Rino Saiga, Kaede Shiga, Yo Maruta, Chie Inomoto, Hiroshi Kajiwara, Naoya Nakamura, Yu Kakimoto, Yoshiro Yamamoto, Masahiro Yasutake, Masayuki Uesugi, Akihisa Takeuchi, Kentaro Uesugi, Yasuko Terada, Yoshio Suzuki, Viktor Nikitin, Vincent De Andrade, Francesco De Carlo, Yuichi Yamashita, Masanari Itokawa, Soichiro Ide, Kazutaka Ikeda, Ryuta Mizutani

Abstract: Mouse and human brains have different functions that depend on their neuronal networks. In this study, we analyzed nanometer-scale three-dimensional structures of brain tissues of the mouse medial prefrontal cortex and compared them with structures of the human anterior cingulate cortex. The obtained results indicated that mouse neuronal somata are smaller and neurites are thinner than those of hu… ▽ More Mouse and human brains have different functions that depend on their neuronal networks. In this study, we analyzed nanometer-scale three-dimensional structures of brain tissues of the mouse medial prefrontal cortex and compared them with structures of the human anterior cingulate cortex. The obtained results indicated that mouse neuronal somata are smaller and neurites are thinner than those of human neurons. These structural features allow mouse neurons to be integrated in the limited space of the brain, though thin neurites should suppress distal connections according to cable theory. We implemented this mouse-mimetic constraint in convolutional layers of a generative adversarial network (GAN) and a denoising diffusion implicit model (DDIM), which were then subjected to image generation tasks using photo datasets of cat faces, cheese, human faces, and birds. The mouse-mimetic GAN outperformed a standard GAN in the image generation task using the cat faces and cheese photo datasets, but underperformed for human faces and birds. The mouse-mimetic DDIM gave similar results, suggesting that the nature of the datasets affected the results. Analyses of the four datasets indicated differences in their image entropy, which should influence the number of parameters required for image generation. The preferences of the mouse-mimetic AIs coincided with the impressions commonly associated with mice. The relationship between the neuronal network and brain function should be investigated by implementing other biological findings in artificial neural networks. △ Less

Submitted 28 October, 2024; originally announced October 2024.

Comments: 41 pages, 4 figures

Journal ref: Sci. Rep. 15, 25091 (2025)

arXiv:2410.15532 [pdf, ps, other]

Construction and Analysis of Impression Caption Dataset for Environmental Sounds

Authors: Yuki Okamoto, Ryotaro Nagase, Minami Okamoto, Yuki Saito, Keisuke Imoto, Takahiro Fukumori, Yoichi Yamashita

Abstract: Some datasets with the described content and order of occurrence of sounds have been released for conversion between environmental sound and text. However, there are very few texts that include information on the impressions humans feel, such as "sharp" and "gorgeous," when they hear environmental sounds. In this study, we constructed a dataset with impression captions for environmental sounds tha… ▽ More Some datasets with the described content and order of occurrence of sounds have been released for conversion between environmental sound and text. However, there are very few texts that include information on the impressions humans feel, such as "sharp" and "gorgeous," when they hear environmental sounds. In this study, we constructed a dataset with impression captions for environmental sounds that describe the impressions humans have when hearing these sounds. We used ChatGPT to generate impression captions and selected the most appropriate captions for sound by humans. Our dataset consists of 3,600 impression captions for environmental sounds. To evaluate the appropriateness of impression captions for environmental sounds, we conducted subjective and objective evaluations. From our evaluation results, we indicate that appropriate impression captions for environmental sounds can be generated. △ Less

Submitted 20 October, 2024; originally announced October 2024.

arXiv:2306.04143 [pdf, other]

RISC: A Corpus for Shout Type Classification and Shout Intensity Prediction

Authors: Takahiro Fukumori, Taito Ishida, Yoichi Yamashita

Abstract: The detection of shouted speech is crucial in audio surveillance and monitoring. Although it is desirable for a security system to be able to identify emergencies, existing corpora provide only a binary label (i.e., shouted or normal) for each speech sample, making it difficult to predict the shout intensity. Furthermore, most corpora comprise only utterances typical of hazardous situations, meani… ▽ More The detection of shouted speech is crucial in audio surveillance and monitoring. Although it is desirable for a security system to be able to identify emergencies, existing corpora provide only a binary label (i.e., shouted or normal) for each speech sample, making it difficult to predict the shout intensity. Furthermore, most corpora comprise only utterances typical of hazardous situations, meaning that classifiers cannot learn to discriminate such utterances from shouts typical of less hazardous situations, such as cheers. Thus, this paper presents a novel research source, the RItsumeikan Shout Corpus (RISC), which contains wide variety types of shouted speech samples collected in recording experiments. Each shouted speech sample in RISC has a shout type and is also assigned shout intensity ratings via a crowdsourcing service. We also present a comprehensive performance comparison among deep learning approaches for speech type classification tasks and a shout intensity prediction task. The results show that feature learning based on the spectral and cepstral domains achieves high performance, no matter which network architecture is used. The results also demonstrate that shout type classification and intensity prediction are still challenging tasks, and RISC is expected to contribute to further development in this research area. △ Less

Submitted 19 October, 2024; v1 submitted 7 June, 2023; originally announced June 2023.

Comments: This paper has been accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing. DOI: 10.1109/TASLP.2024.3473302

arXiv:2305.00302 [pdf, ps, other]

Environmental sound synthesis from vocal imitations and sound event labels

Authors: Yuki Okamoto, Keisuke Imoto, Shinnosuke Takamichi, Ryotaro Nagase, Takahiro Fukumori, Yoichi Yamashita

Abstract: One way of expressing an environmental sound is using vocal imitations, which involve the process of replicating or mimicking the rhythm and pitch of sounds by voice. We can effectively express the features of environmental sounds, such as rhythm and pitch, using vocal imitations, which cannot be expressed by conventional input information, such as sound event labels, images, or texts, in an envir… ▽ More One way of expressing an environmental sound is using vocal imitations, which involve the process of replicating or mimicking the rhythm and pitch of sounds by voice. We can effectively express the features of environmental sounds, such as rhythm and pitch, using vocal imitations, which cannot be expressed by conventional input information, such as sound event labels, images, or texts, in an environmental sound synthesis model. In this paper, we propose a framework for environmental sound synthesis from vocal imitations and sound event labels based on a framework of a vector quantized encoder and the Tacotron2 decoder. Using vocal imitations is expected to control the pitch and rhythm of the synthesized sound, which only sound event labels cannot control. Our objective and subjective experimental results show that vocal imitations effectively control the pitch and rhythm of synthesized sounds. △ Less

Submitted 14 September, 2023; v1 submitted 29 April, 2023; originally announced May 2023.

Comments: Submitted to ICASSP2024

arXiv:2208.07679 [pdf, ps, other]

How Should We Evaluate Synthesized Environmental Sounds

Authors: Yuki Okamoto, Keisuke Imoto, Shinnosuke Takamichi, Takahiro Fukumori, Yoichi Yamashita

Abstract: Although several methods of environmental sound synthesis have been proposed, there has been no discussion on how synthesized environmental sounds should be evaluated. Only either subjective or objective evaluations have been conducted in conventional evaluations, and it is not clear what type of evaluation should be carried out. In this paper, we investigate how to evaluate synthesized environmen… ▽ More Although several methods of environmental sound synthesis have been proposed, there has been no discussion on how synthesized environmental sounds should be evaluated. Only either subjective or objective evaluations have been conducted in conventional evaluations, and it is not clear what type of evaluation should be carried out. In this paper, we investigate how to evaluate synthesized environmental sounds. We also propose a subjective evaluation methodology to evaluate whether the synthesized sound appropriately represents the information input to the environmental sound synthesis system. In our experiments, we compare the proposed and conventional evaluation methods and show that the results of subjective evaluations tended to differ from those of objective evaluations. From these results, we conclude that it is necessary to conduct not only objective evaluation but also subjective evaluation. △ Less

Submitted 16 August, 2022; originally announced August 2022.

Comments: Submitted APSIPA ASC 2022

arXiv:2207.10106 [pdf, ps, other]

World Robot Challenge 2020 -- Partner Robot: A Data-Driven Approach for Room Tidying with Mobile Manipulator

Authors: Tatsuya Matsushima, Yuki Noguchi, Jumpei Arima, Toshiki Aoki, Yuki Okita, Yuya Ikeda, Koki Ishimoto, Shohei Taniguchi, Yuki Yamashita, Shoichi Seto, Shixiang Shane Gu, Yusuke Iwasawa, Yutaka Matsuo

Abstract: Tidying up a household environment using a mobile manipulator poses various challenges in robotics, such as adaptation to large real-world environmental variations, and safe and robust deployment in the presence of humans.The Partner Robot Challenge in World Robot Challenge (WRC) 2020, a global competition held in September 2021, benchmarked tidying tasks in the real home environments, and importa… ▽ More Tidying up a household environment using a mobile manipulator poses various challenges in robotics, such as adaptation to large real-world environmental variations, and safe and robust deployment in the presence of humans.The Partner Robot Challenge in World Robot Challenge (WRC) 2020, a global competition held in September 2021, benchmarked tidying tasks in the real home environments, and importantly, tested for full system performances.For this challenge, we developed an entire household service robot system, which leverages a data-driven approach to adapt to numerous edge cases that occur during the execution, instead of classical manual pre-programmed solutions. In this paper, we describe the core ingredients of the proposed robot system, including visual recognition, object manipulation, and motion planning. Our robot system won the second prize, verifying the effectiveness and potential of data-driven robot systems for mobile manipulation in home environments. △ Less

Submitted 21 July, 2022; v1 submitted 20 July, 2022; originally announced July 2022.

arXiv:2111.02666 [pdf, ps, other]

Emergence of sensory attenuation based upon the free-energy principle

Authors: Hayato Idei, Wataru Ohata, Yuichi Yamashita, Tetsuya Ogata, Jun Tani

Abstract: The brain attenuates its responses to self-produced exteroceptions (e.g., we cannot tickle ourselves). Is this phenomenon, known as sensory attenuation, enabled innately, or acquired through learning? Here, our simulation study using a multimodal hierarchical recurrent neural network model, based on variational free-energy minimization, shows that a mechanism for sensory attenuation can develop th… ▽ More The brain attenuates its responses to self-produced exteroceptions (e.g., we cannot tickle ourselves). Is this phenomenon, known as sensory attenuation, enabled innately, or acquired through learning? Here, our simulation study using a multimodal hierarchical recurrent neural network model, based on variational free-energy minimization, shows that a mechanism for sensory attenuation can develop through learning of two distinct types of sensorimotor experience, involving self-produced or externally produced exteroceptions. For each sensorimotor context, a particular free-energy state emerged through interaction between top-down prediction with precision and bottom-up sensory prediction error from each sensory area. The executive area in the network served as an information hub. Consequently, shifts between the two sensorimotor contexts triggered transitions from one free-energy state to another in the network via executive control, which caused shifts between attenuating and amplifying prediction-error-induced responses in the sensory areas. This study situates emergence of sensory attenuation (or self-other distinction) in development of distinct free-energy states in the dynamic hierarchical neural system. △ Less

Submitted 12 August, 2022; v1 submitted 4 November, 2021; originally announced November 2021.

arXiv:2110.11866 [pdf, ps, other]

Morlet wavelet transform using attenuated sliding Fourier transform and kernel integral for graphic processing unit

Authors: Yukihiko Yamashita, Toru Wakahara

Abstract: Morlet or Gabor wavelet transforms as well as Gaussian smoothing, are widely used in signal processing and image processing. However, the computational complexity of their direct calculations is proportional not only to the number of data points in a signal but also to the smoothing size, which is the standard deviation in the Gaussian function in their transform functions. Thus, when the standard… ▽ More Morlet or Gabor wavelet transforms as well as Gaussian smoothing, are widely used in signal processing and image processing. However, the computational complexity of their direct calculations is proportional not only to the number of data points in a signal but also to the smoothing size, which is the standard deviation in the Gaussian function in their transform functions. Thus, when the standard deviation is large, its considerable computation time diminishes the advantages of aforementioned transforms. Therefore, it is important to formulate an algorithm to reduce the calculation time of the transformations. In this paper, we first review calculation methods of Gaussian smoothing by using the sliding Fourier transform (SFT) and our proposed attenuated SFT (ASFT) \cite{YamashitaICPR2020}. Based on these methods, we propose two types of calculation methods for Morlet wavelet transforms. We also propose an algorithm to calculate SFT using the kernel integral on graphic processing unit (GPU). When the number of calculation cores in GPU is not less than the number of data points, the order of its calculation time is the logarithm of the smoothing size and does not depend on the number of data points. Using experiments, we compare the two methods for calculating the Morlet wavelet transform and evaluate the calculation time of the proposed algorithm using a kernel integral on GPU. For example, when the number of data points and the standard deviation are 102400 and 8192.0, respectively, the calculation time of the Morlet wavelet transform by the proposed method is 0.545 ms, which 413.6 times faster than a conventional method. (In this version, mistakes in fitures are corrected.) △ Less

Submitted 24 June, 2024; v1 submitted 3 September, 2021; originally announced October 2021.

Comments: 18 pages

ACM Class: I.5.4

arXiv:2110.03243 [pdf, ps, other]

Sound Event Detection Guided by Semantic Contexts of Scenes

Authors: Noriyuki Tonami, Keisuke Imoto, Ryotaro Nagase, Yuki Okamoto, Takahiro Fukumori, Yoichi Yamashita

Abstract: Some studies have revealed that contexts of scenes (e.g., "home," "office," and "cooking") are advantageous for sound event detection (SED). Mobile devices and sensing technologies give useful information on scenes for SED without the use of acoustic signals. However, conventional methods can employ pre-defined contexts in inference stages but not undefined contexts. This is because one-hot repres… ▽ More Some studies have revealed that contexts of scenes (e.g., "home," "office," and "cooking") are advantageous for sound event detection (SED). Mobile devices and sensing technologies give useful information on scenes for SED without the use of acoustic signals. However, conventional methods can employ pre-defined contexts in inference stages but not undefined contexts. This is because one-hot representations of pre-defined scenes are exploited as prior contexts for such conventional methods. To alleviate this problem, we propose scene-informed SED where pre-defined scene-agnostic contexts are available for more accurate SED. In the proposed method, pre-trained large-scale language models are utilized, which enables SED models to employ unseen semantic contexts of scenes in inference stages. Moreover, we investigated the extent to which the semantic representation of scene contexts is useful for SED. Experimental results performed with TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016/2017 datasets show that the proposed method improves micro and macro F-scores by 4.34 and 3.13 percentage points compared with conventional Conformer- and CNN--BiGRU-based SED, respectively. △ Less

Submitted 17 February, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

Comments: Accepted to ICASSP 2022

arXiv:2102.05872 [pdf, ps, other]

Onoma-to-wave: Environmental sound synthesis from onomatopoeic words

Authors: Yuki Okamoto, Keisuke Imoto, Shinnosuke Takamichi, Ryosuke Yamanishi, Takahiro Fukumori, Yoichi Yamashita

Abstract: In this paper, we propose a framework for environmental sound synthesis from onomatopoeic words. As one way of expressing an environmental sound, we can use an onomatopoeic word, which is a character sequence for phonetically imitating a sound. An onomatopoeic word is effective for describing diverse sound features. Therefore, using onomatopoeic words for environmental sound synthesis will enable… ▽ More In this paper, we propose a framework for environmental sound synthesis from onomatopoeic words. As one way of expressing an environmental sound, we can use an onomatopoeic word, which is a character sequence for phonetically imitating a sound. An onomatopoeic word is effective for describing diverse sound features. Therefore, using onomatopoeic words for environmental sound synthesis will enable us to generate diverse environmental sounds. To generate diverse sounds, we propose a method based on a sequence-to-sequence framework for synthesizing environmental sounds from onomatopoeic words. We also propose a method of environmental sound synthesis using onomatopoeic words and sound event labels. The use of sound event labels in addition to onomatopoeic words enables us to capture each sound event's feature depending on the input sound event label. Our subjective experiments show that our proposed methods achieve higher diversity and naturalness than conventional methods using sound event labels. △ Less

Submitted 7 February, 2022; v1 submitted 11 February, 2021; originally announced February 2021.

Comments: Accepted to APSIPA Transactions on Signal and Information Processing

arXiv:2102.05288 [pdf, ps, other]

Sound Event Detection Based on Curriculum Learning Considering Learning Difficulty of Events

Authors: Noriyuki Tonami, Keisuke Imoto, Yuki Okamoto, Takahiro Fukumori, Yoichi Yamashita

Abstract: In conventional sound event detection (SED) models, two types of events, namely, those that are present and those that do not occur in an acoustic scene, are regarded as the same type of events. The conventional SED methods cannot effectively exploit the difference between the two types of events. All time frames of sound events that do not occur in an acoustic scene are easily regarded as inactiv… ▽ More In conventional sound event detection (SED) models, two types of events, namely, those that are present and those that do not occur in an acoustic scene, are regarded as the same type of events. The conventional SED methods cannot effectively exploit the difference between the two types of events. All time frames of sound events that do not occur in an acoustic scene are easily regarded as inactive in the scene, that is, the events are easy-to-train. The time frames of the events that are present in a scene must be classified as active in addition to inactive in the acoustic scene, that is, the events are difficult-to-train. To take advantage of the training difficulty, we apply curriculum learning into SED, where models are trained from easy- to difficult-to-train events. To utilize the curriculum learning, we propose a new objective function for SED, wherein the events are trained from easy- to difficult-to-train events. Experimental results show that the F-score of the proposed method is improved by 10.09 percentage points compared with that of the conventional binary cross entropy-based SED. △ Less

Submitted 10 February, 2021; originally announced February 2021.

Comments: Accepted to ICASSP 2021

arXiv:2010.09213 [pdf, ps, other]

doi 10.1587/transinf.2020EDP7036

Joint Analysis of Sound Events and Acoustic Scenes Using Multitask Learning

Authors: Noriyuki Tonami, Keisuke Imoto, Ryosuke Yamanishi, Yoichi Yamashita

Abstract: Sound event detection (SED) and acoustic scene classification (ASC) are important research topics in environmental sound analysis. Many research groups have addressed SED and ASC using neural-network-based methods, such as the convolutional neural network (CNN), recurrent neural network (RNN), and convolutional recurrent neural network (CRNN). The conventional methods address SED and ASC separatel… ▽ More Sound event detection (SED) and acoustic scene classification (ASC) are important research topics in environmental sound analysis. Many research groups have addressed SED and ASC using neural-network-based methods, such as the convolutional neural network (CNN), recurrent neural network (RNN), and convolutional recurrent neural network (CRNN). The conventional methods address SED and ASC separately even though sound events and acoustic scenes are closely related to each other. For example, in the acoustic scene "office," the sound events "mouse clicking" and "keyboard typing" are likely to occur. Therefore, it is expected that information on sound events and acoustic scenes will be of mutual aid for SED and ASC. In this paper, we propose multitask learning for joint analysis of sound events and acoustic scenes, in which the parts of the networks holding information on sound events and acoustic scenes in common are shared. Experimental results obtained using the TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets indicate that the proposed method improves the performance of SED and ASC by 1.31 and 1.80 percentage points in terms of the F-score, respectively, compared with the conventional CRNN-based method. △ Less

Submitted 16 October, 2020; originally announced October 2020.

Comments: Accepted to IEICE Transactions on Information and Systems. arXiv admin note: text overlap with arXiv:1904.12146

arXiv:2009.10887 [pdf]

doi 10.3389/fnbot.2022.851471

Schizophrenia-mimicking layers outperform conventional neural network layers

Authors: Ryuta Mizutani, Senta Noguchi, Rino Saiga, Yuichi Yamashita, Mitsuhiro Miyashita, Makoto Arai, Masanari Itokawa

Abstract: We have reported nanometer-scale three-dimensional studies of brain networks of schizophrenia cases and found that their neurites are thin and tortuous compared to healthy controls. This suggests that connections between distal neurons are suppressed in microcircuits of schizophrenia cases. In this study, we applied these biological findings to the design of schizophrenia-mimicking artificial neur… ▽ More We have reported nanometer-scale three-dimensional studies of brain networks of schizophrenia cases and found that their neurites are thin and tortuous compared to healthy controls. This suggests that connections between distal neurons are suppressed in microcircuits of schizophrenia cases. In this study, we applied these biological findings to the design of schizophrenia-mimicking artificial neural network to simulate the observed connection alteration in the disorder. Neural networks having a "schizophrenia connection layer" in place of a fully connected layer were subjected to image classification tasks using the MNIST and CIFAR-10 datasets. The results revealed that the schizophrenia connection layer is tolerant to overfitting and outperforms a fully connected layer. The outperformance was observed only for networks using band matrices as weight windows, indicating that the shape of the weight matrix is relevant to the network performance. A schizophrenia convolution layer was also tested using the VGG configuration, showing that 60% of the kernel weights of the last three convolution layers can be eliminated without loss of accuracy. The schizophrenia layers can be used instead of conventional layers without any change in the network configuration and training procedures; hence, neural networks can easily take advantage of these layers. The results of this study suggest that the connection alteration found in schizophrenia is not a burden to the brain, but has functional roles in brain performance. △ Less

Submitted 1 April, 2022; v1 submitted 22 September, 2020; originally announced September 2020.

Comments: 16 pages, 6 figures, and 1 table

Journal ref: Frontiers Neurorobot 16, 851471 (2022)

arXiv:2007.04719 [pdf, ps, other]

RWCP-SSD-Onomatopoeia: Onomatopoeic Word Dataset for Environmental Sound Synthesis

Authors: Yuki Okamoto, Keisuke Imoto, Shinnosuke Takamichi, Ryosuke Yamanishi, Takahiro Fukumori, Yoichi Yamashita

Abstract: Environmental sound synthesis is a technique for generating a natural environmental sound. Conventional work on environmental sound synthesis using sound event labels cannot finely control synthesized sounds, for example, the pitch and timbre. We consider that onomatopoeic words can be used for environmental sound synthesis. Onomatopoeic words are effective for explaining the feature of sounds. We… ▽ More Environmental sound synthesis is a technique for generating a natural environmental sound. Conventional work on environmental sound synthesis using sound event labels cannot finely control synthesized sounds, for example, the pitch and timbre. We consider that onomatopoeic words can be used for environmental sound synthesis. Onomatopoeic words are effective for explaining the feature of sounds. We believe that using onomatopoeic words will enable us to control the fine time-frequency structure of synthesized sounds. However, there is no dataset available for environmental sound synthesis using onomatopoeic words. In this paper, we thus present RWCP-SSD-Onomatopoeia, a dataset consisting of 155,568 onomatopoeic words paired with audio samples for environmental sound synthesis. We also collected self-reported confidence scores and others-reported acceptance scores of onomatopoeic words, to help us investigate the difficulty in the transcription and selection of a suitable word for environmental sound synthesis. △ Less

Submitted 9 July, 2020; originally announced July 2020.

Comments: Submitted to DCASE2020 workshop

arXiv:2006.15253 [pdf, ps, other]

Sound Event Detection Using Duration Robust Loss Function

Authors: Daichi Akiyama, Keisuke Imoto, Noriyuki Tonami, Yuki Okamoto, Ryosuke Yamanishi, Takahiro Fukumori, Yoichi Yamashita

Abstract: Many methods of sound event detection (SED) based on machine learning regard a segmented time frame as one data sample to model training. However, the sound durations of sound events vary greatly depending on the sound event class, e.g., the sound event ``fan'' has a long time duration, while the sound event ``mouse clicking'' is instantaneous. The difference in the time duration between sound eve… ▽ More Many methods of sound event detection (SED) based on machine learning regard a segmented time frame as one data sample to model training. However, the sound durations of sound events vary greatly depending on the sound event class, e.g., the sound event ``fan'' has a long time duration, while the sound event ``mouse clicking'' is instantaneous. The difference in the time duration between sound event classes thus causes a serious data imbalance problem in SED. In this paper, we propose a method for SED using a duration robust loss function, which can focus model training on sound events of short duration. In the proposed method, we focus on a relationship between the duration of the sound event and the ease/difficulty of model training. In particular, many sound events of long duration (e.g., sound event ``fan'') are stationary sounds, which have less variation in their acoustic features and their model training is easy. Meanwhile, some sound events of short duration (e.g., sound event ``object impact'') have more than one audio pattern, such as attack, decay, and release parts. We thus apply a class-wise reweighting to the binary-cross entropy loss function depending on the ease/difficulty of model training. Evaluation experiments conducted using TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets show that the proposed method respectively improves the detection performance of sound events by 3.15 and 4.37 percentage points in macro- and micro-Fscores compared with a conventional method using the binary-cross entropy loss function. △ Less

Submitted 26 June, 2020; originally announced June 2020.

Comments: Submitted to DCASE2020 Workshop

arXiv:2002.05848 [pdf, ps, other]

Sound Event Detection by Multitask Learning of Sound Events and Scenes with Soft Scene Labels

Authors: Keisuke Imoto, Noriyuki Tonami, Yuma Koizumi, Masahiro Yasuda, Ryosuke Yamanishi, Yoichi Yamashita

Abstract: Sound event detection (SED) and acoustic scene classification (ASC) are major tasks in environmental sound analysis. Considering that sound events and scenes are closely related to each other, some works have addressed joint analyses of sound events and acoustic scenes based on multitask learning (MTL), in which the knowledge of sound events and scenes can help in estimating them mutually. The con… ▽ More Sound event detection (SED) and acoustic scene classification (ASC) are major tasks in environmental sound analysis. Considering that sound events and scenes are closely related to each other, some works have addressed joint analyses of sound events and acoustic scenes based on multitask learning (MTL), in which the knowledge of sound events and scenes can help in estimating them mutually. The conventional MTL-based methods utilize one-hot scene labels to train the relationship between sound events and scenes; thus, the conventional methods cannot model the extent to which sound events and scenes are related. However, in the real environment, common sound events may occur in some acoustic scenes; on the other hand, some sound events occur only in a limited acoustic scene. In this paper, we thus propose a new method for SED based on MTL of SED and ASC using the soft labels of acoustic scenes, which enable us to model the extent to which sound events and scenes are related. Experiments conducted using TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets show that the proposed method improves the SED performance by 3.80% in F-score compared with conventional MTL-based SED. △ Less

Submitted 13 February, 2020; originally announced February 2020.

Comments: Accepted to ICASSP 2020

arXiv:2001.06988 [pdf, other]

doi 10.1371/journal.pone.0286072

Deep learning generates custom-made logistic regression models for explaining how breast cancer subtypes are classified

Authors: Takuma Shibahara, Chisa Wada, Yasuho Yamashita, Kazuhiro Fujita, Masamichi Sato, Junichi Kuwata, Atsushi Okamoto, Yoshimasa Ono

Abstract: Differentiating the intrinsic subtypes of breast cancer is crucial for deciding the best treatment strategy. Deep learning can predict the subtypes from genetic information more accurately than conventional statistical methods, but to date, deep learning has not been directly utilized to examine which genes are associated with which subtypes. To clarify the mechanisms embedded in the intrinsic sub… ▽ More Differentiating the intrinsic subtypes of breast cancer is crucial for deciding the best treatment strategy. Deep learning can predict the subtypes from genetic information more accurately than conventional statistical methods, but to date, deep learning has not been directly utilized to examine which genes are associated with which subtypes. To clarify the mechanisms embedded in the intrinsic subtypes, we developed an explainable deep learning model called a point-wise linear (PWL) model that generates a custom-made logistic regression for each patient. Logistic regression, which is familiar to both physicians and medical informatics researchers, allows us to analyze the importance of the feature variables, and the PWL model harnesses these practical abilities of logistic regression. In this study, we show that analyzing breast cancer subtypes is clinically beneficial for patients and one of the best ways to validate the capability of the PWL model. First, we trained the PWL model with RNA-seq data to predict PAM50 intrinsic subtypes and applied it to the 41/50 genes of PAM50 through the subtype prediction task. Second, we developed a deep enrichment analysis method to reveal the relationships between the PAM50 subtypes and the copy numbers of breast cancer. Our findings showed that the PWL model utilized genes relevant to the cell cycle-related pathways. These preliminary successes in breast cancer subtype analysis demonstrate the potential of our analysis strategy to clarify the mechanisms underlying breast cancer and improve overall clinical outcomes. △ Less

Submitted 18 July, 2022; v1 submitted 20 January, 2020; originally announced January 2020.

Comments: 25 pages, 5 figures

arXiv:1908.10055 [pdf, ps, other]

Overview of Tasks and Investigation of Subjective Evaluation Methods in Environmental Sound Synthesis and Conversion

Authors: Yuki Okamoto, Keisuke Imoto, Tatsuya Komatsu, Shinnosuke Takamichi, Takumi Yagyu, Ryosuke Yamanishi, Yoichi Yamashita

Abstract: Synthesizing and converting environmental sounds have the potential for many applications such as supporting movie and game production, data augmentation for sound event detection and scene classification. Conventional works on synthesizing and converting environmental sounds are based on a physical modeling or concatenative approach. However, there are a limited number of works that have addresse… ▽ More Synthesizing and converting environmental sounds have the potential for many applications such as supporting movie and game production, data augmentation for sound event detection and scene classification. Conventional works on synthesizing and converting environmental sounds are based on a physical modeling or concatenative approach. However, there are a limited number of works that have addressed environmental sound synthesis and conversion with statistical generative models; thus, this research area is not yet well organized. In this paper, we review problem definitions, applications, and evaluation methods of environmental sound synthesis and conversion. We then report on environmental sound synthesis using sound event labels, in which we focus on the current performance of statistical environmental sound synthesis and investigate how we should conduct subjective experiments on environmental sound synthesis. △ Less

Submitted 27 August, 2019; originally announced August 2019.

arXiv:1906.10015 [pdf, other]

doi 10.1016/j.neunet.2019.10.014

A Review on Neural Network Models of Schizophrenia and Autism Spectrum Disorder

Authors: Pablo Lanillos, Daniel Oliva, Anja Philippsen, Yuichi Yamashita, Yukie Nagai, Gordon Cheng

Abstract: This survey presents the most relevant neural network models of autism spectrum disorder and schizophrenia, from the first connectionist models to recent deep network architectures. We analyzed and compared the most representative symptoms with its neural model counterpart, detailing the alteration introduced in the network that generates each of the symptoms, and identifying their strengths and w… ▽ More This survey presents the most relevant neural network models of autism spectrum disorder and schizophrenia, from the first connectionist models to recent deep network architectures. We analyzed and compared the most representative symptoms with its neural model counterpart, detailing the alteration introduced in the network that generates each of the symptoms, and identifying their strengths and weaknesses. We additionally cross-compared Bayesian and free-energy approaches, as they are widely applied to modeling psychiatric disorders and share basic mechanisms with neural networks. Models of schizophrenia mainly focused on hallucinations and delusional thoughts using neural dysconnections or inhibitory imbalance as the predominating alteration. Models of autism rather focused on perceptual difficulties, mainly excessive attention to environment details, implemented as excessive inhibitory connections or increased sensory precision. We found an excessive tight view of the psychopathologies around one specific and simplified effect, usually constrained to the technical idiosyncrasy of the used network architecture. Recent theories and evidence on sensorimotor integration and body perception combined with modern neural network architectures could offer a broader and novel spectrum to approach these psychopathologies. This review emphasizes the power of artificial neural networks for modeling some symptoms of neurological disorders but also calls for further developing these techniques in the field of computational psychiatry. △ Less

Submitted 23 October, 2019; v1 submitted 24 June, 2019; originally announced June 2019.

Comments: Preprint submitted to Neural Networks. Research not referenced in the manuscript within the field of NN models of SZ and ASD are encouraged to contact the corresponding authors

Journal ref: Neural Networks 122 (2020) 338-363

arXiv:1904.12146 [pdf, ps, other]

Joint Analysis of Acoustic Events and Scenes Based on Multitask Learning

Authors: Noriyuki Tonami, Keisuke Imoto, Masahiro Niitsuma, Ryosuke Yamanishi, Yoichi Yamashita

Abstract: Acoustic event detection and scene classification are major research tasks in environmental sound analysis, and many methods based on neural networks have been proposed. Conventional methods have addressed these tasks separately; however, acoustic events and scenes are closely related to each other. For example, in the acoustic scene `office', the acoustic events `mouse clicking' and `keyboard typ… ▽ More Acoustic event detection and scene classification are major research tasks in environmental sound analysis, and many methods based on neural networks have been proposed. Conventional methods have addressed these tasks separately; however, acoustic events and scenes are closely related to each other. For example, in the acoustic scene `office', the acoustic events `mouse clicking' and `keyboard typing' are likely to occur. In this paper, we propose multitask learning for joint analysis of acoustic events and scenes, which shares the parts of the networks holding information on acoustic events and scenes in common. By integrating the two networks, we expect that information on acoustic scenes will improve the performance of acoustic event detection. Experimental results obtained using TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets indicate that the proposed method improves the performance of acoustic event detection by 10.66 percentage points in terms of the F-score, compared with a conventional method based on a convolutional recurrent neural network. △ Less

Submitted 18 July, 2019; v1 submitted 27 April, 2019; originally announced April 2019.

Comments: Accepted to WASPAA 2019

arXiv:1708.01387 [pdf]

Research Activity Classification based on Time Series Bibliometrics

Authors: Takahiro Kawamura, Yasuhiro Yamashita, Katsuji Matsumura

Abstract: Bibliometrics such as the number of papers and times cited are often used to compare researchers based on specific criteria. The criteria, however, are different in each research domain and are set by empirical laws. Moreover, there are arguments, such that the simple sum of metric values works to the advantage of elders. Therefore, this paper attempts to constitute features from time series data… ▽ More Bibliometrics such as the number of papers and times cited are often used to compare researchers based on specific criteria. The criteria, however, are different in each research domain and are set by empirical laws. Moreover, there are arguments, such that the simple sum of metric values works to the advantage of elders. Therefore, this paper attempts to constitute features from time series data of bibliometrics, and then classify the researchers according to the features. In detail, time series patterns are extracted from bibliographic data sets, and then a model to classify whether the researchers are "distinguished" or not is created by a machine learning technique. The experiments achieved an F-measure of 80.0% in the classification of 114 researchers in two research domains based on the data sets of Japan Science and Technology Agency and Elsevier's Scopus. In the future, we will conduct verification on a number of researchers in several domains, and then make use of discovering "distinguished" researchers, who are not widely known. △ Less

Submitted 4 August, 2017; originally announced August 2017.

Journal ref: Proceedings of 21st International Conference on Science and Technology Indicators (STI 2016), pp. 1456-1460 (2016)

arXiv:1605.03754 [pdf, other]

Regression-based Intra-prediction for Image and Video Coding

Authors: Carlo Noel Ochotorena, Yukihiko Yamashita

Abstract: By utilizing previously known areas in an image, intra-prediction techniques can find a good estimate of the current block. This allows the encoder to store only the error between the original block and the generated estimate, thus leading to an improvement in coding efficiency. Standards such as AVC and HEVC describe expert-designed prediction modes operating in certain angular orientations along… ▽ More By utilizing previously known areas in an image, intra-prediction techniques can find a good estimate of the current block. This allows the encoder to store only the error between the original block and the generated estimate, thus leading to an improvement in coding efficiency. Standards such as AVC and HEVC describe expert-designed prediction modes operating in certain angular orientations alongside separate DC and planar prediction modes. Being designed predictors, while these techniques have been demonstrated to perform well in image and video coding applications, they do not necessarily fully utilize natural image structures. In this paper, we describe a novel system for developing predictors derived from natural image blocks. The proposed algorithm is seeded with designed predictors (e.g. HEVC-style prediction) and allowed to iteratively refine these predictors through regularized regression. The resulting prediction models show significant improvements in estimation quality over their designed counterparts across all conditions while maintaining reasonable computational complexity. We also demonstrate how the proposed algorithm handles the worst-case scenario of intra-prediction with no error reporting. △ Less

Submitted 12 May, 2016; originally announced May 2016.

Showing 1–23 of 23 results for author: Yamashita, Y