Skip to main content

Showing 1–8 of 8 results for author: Sawata, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2502.12080  [pdf, ps, other

    cs.CV

    HumanGif: Single-View Human Diffusion with Generative Prior

    Authors: Shoukang Hu, Takuya Narihira, Kazumi Fukuda, Ryosuke Sawata, Takashi Shibuya, Yuki Mitsufuji

    Abstract: Previous 3D human creation methods have made significant progress in synthesizing view-consistent and temporally aligned results from sparse-view images or monocular videos. However, it remains challenging to produce perpetually realistic, view-consistent, and temporally coherent human avatars from a single image, as limited information is available in the single-view input setting. Motivated by t… ▽ More

    Submitted 29 June, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

    Comments: Project page: https://skhu101.github.io/HumanGif/

  2. arXiv:2406.07532  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    Hearing Anything Anywhere

    Authors: Mason Wang, Ryosuke Sawata, Samuel Clarke, Ruohan Gao, Shangzhe Wu, Jiajun Wu

    Abstract: Recent years have seen immense progress in 3D computer vision and computer graphics, with emerging tools that can virtualize real-world 3D environments for numerous Mixed Reality (XR) applications. However, alongside immersive visual experiences, immersive auditory experiences are equally vital to our holistic perception of an environment. In this paper, we aim to reconstruct the spatial acoustic… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: CVPR 2024. The first two authors contributed equally. Project page: https://masonlwang.com/hearinganythinganywhere/

    ACM Class: I.2.10; I.4.8

  3. The Whole Is Greater than the Sum of Its Parts: Improving Music Source Separation by Bridging Network

    Authors: Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the… ▽ More

    Submitted 5 August, 2024; v1 submitted 13 May, 2023; originally announced May 2023.

    Comments: Acceptedt by EURASIP Journal on Audio, Speech, and Music Processing (under CC BY)

    Journal ref: EURASIP Journal on Audio, Speech, and Music Processing (JASM), 39 (2024)

  4. Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement

    Authors: Ryosuke Sawata, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Although deep neural network (DNN)-based speech enhancement (SE) methods outperform the previous non-DNN-based ones, they often degrade the perceptual quality of generated outputs. To tackle this problem, we introduce a DNN-based generative refiner, Diffiner, aiming to improve perceptual speech quality pre-processed by an SE method. We train a diffusion-based generative model by utilizing a datase… ▽ More

    Submitted 30 August, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: Accepted by Interspeech 2023

  5. arXiv:2210.05148  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability

    Authors: Kin Wai Cheuk, Ryosuke Sawata, Toshimitsu Uesaka, Naoki Murata, Naoya Takahashi, Shusuke Takahashi, Dorien Herremans, Yuki Mitsufuji

    Abstract: In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT). Instead of treating AMT as a discriminative task in which the model is trained to convert spectrograms into piano rolls, we think of it as a conditional generative task where we train our model to generate realistic looking piano rolls from pure Gaussian noise conditioned on spectrograms.… ▽ More

    Submitted 20 October, 2022; v1 submitted 11 October, 2022; originally announced October 2022.

    Journal ref: Proceedings of ICASSP - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5. IEEE, 2023

  6. arXiv:2110.05968  [pdf, ps, other

    eess.AS cs.AI

    Improving Character Error Rate Is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-box Acoustic Models

    Authors: Ryosuke Sawata, Yosuke Kashiwagi, Shusuke Takahashi

    Abstract: A deep neural network (DNN)-based speech enhancement (SE) aiming to maximize the performance of an automatic speech recognition (ASR) system is proposed in this paper. In order to optimize the DNN-based SE model in terms of the character error rate (CER), which is one of the metric to evaluate the ASR system and generally non-differentiable, our method uses two DNNs: one for speech processing and… ▽ More

    Submitted 22 February, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

    Comments: Accepted by ICASSP 2022

  7. Manifold-Aware Deep Clustering: Maximizing Angles between Embedding Vectors Based on Regular Simplex

    Authors: Keitaro Tanaka, Ryosuke Sawata, Shusuke Takahashi

    Abstract: This paper presents a new deep clustering (DC) method called manifold-aware DC (M-DC) that can enhance hyperspace utilization more effectively than the original DC. The original DC has a limitation in that a pair of two speakers has to be embedded having an orthogonal relationship due to its use of the one-hot vector-based loss function, while our method derives a unique loss function aimed at max… ▽ More

    Submitted 16 October, 2023; v1 submitted 4 June, 2021; originally announced June 2021.

    Comments: Accepted by Interspeech 2021

  8. arXiv:2010.04228  [pdf, ps, other

    eess.AS cs.SD

    All for One and One for All: Improving Music Separation by Bridging Networks

    Authors: Ryosuke Sawata, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: This paper proposes several improvements for music separation with deep neural networks (DNNs), namely a multi-domain loss (MDL) and two combination schemes. First, by using MDL we take advantage of the frequency and time domain representation of audio signals. Next, we utilize the relationship among instruments by jointly considering them. We do this on the one hand by modifying the network archi… ▽ More

    Submitted 11 May, 2021; v1 submitted 8 October, 2020; originally announced October 2020.

    Comments: The both implementations of our code, i.e., NNabla and PyTorch, are available on this latest paper