Skip to main content

Showing 1–50 of 77 results for author: Yan, B

Searching in archive eess. Search in all archives.
.
  1. arXiv:2509.14161  [pdf, ps, other

    cs.CL cs.SD eess.AS

    CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset

    Authors: Brian Yan, Injy Hamed, Shuichiro Shimizu, Vasista Lodagala, William Chen, Olga Iakovenko, Bashar Talafha, Amir Hussein, Alexander Polok, Kalvin Chang, Dominik Klement, Sara Althubaiti, Puyuan Peng, Matthew Wiesner, Thamar Solorio, Ahmed Ali, Sanjeev Khudanpur, Shinji Watanabe, Chih-Chen Chen, Zhen Wu, Karim Benharrak, Anuj Diwan, Samuele Cornell, Eunjung Yeo, Kwanghee Choi , et al. (2 additional authors not shown)

    Abstract: We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English… ▽ More

    Submitted 17 September, 2025; originally announced September 2025.

  2. arXiv:2509.03010  [pdf, ps, other

    cs.CL cs.LG eess.AS

    Mitigating Data Imbalance in Automated Speaking Assessment

    Authors: Fong-Chun Tsai, Kuan-Tang Huang, Bi-Cheng Yan, Tien-Hong Lo, Berlin Chen

    Abstract: Automated Speaking Assessment (ASA) plays a crucial role in evaluating second-language (L2) learners proficiency. However, ASA models often suffer from class imbalance, leading to biased predictions. To address this, we introduce a novel objective for training ASA models, dubbed the Balancing Logit Variation (BLV) loss, which perturbs model predictions to improve feature representation for minorit… ▽ More

    Submitted 3 September, 2025; originally announced September 2025.

    Comments: Submitted to APSIPA 2025

  3. arXiv:2503.24313  [pdf

    physics.optics eess.SP

    1-Tb/s/λ Transmission over Record 10714-km AR-HCF

    Authors: Dawei Ge, Siyuan Liu, Qiang Qiu, Peng Li, Qiang Guo, Yiqi Li, Dong Wang, Baoluo Yan, Mingqing Zuo, Lei Zhang, Dechao Zhang, Hu Shi, Jie Luo, Han Li, Zhangyuan Chen

    Abstract: We present the first single-channel 1.001-Tb/s DP-36QAM-PCS recirculating transmission over 73 loops of 146.77-km ultra-low-loss & low-IMI DNANF-5 fiber, achieving a record transmission distance of 10,714.28 km.

    Submitted 2 April, 2025; v1 submitted 31 March, 2025; originally announced March 2025.

  4. arXiv:2503.14966  [pdf, other

    cs.CV eess.IV

    Ultrasound Image-to-Video Synthesis via Latent Dynamic Diffusion Models

    Authors: Tingxiu Chen, Yilei Shi, Zixuan Zheng, Bingcong Yan, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou

    Abstract: Ultrasound video classification enables automated diagnosis and has emerged as an important research area. However, publicly available ultrasound video datasets remain scarce, hindering progress in developing effective video classification models. We propose addressing this shortage by synthesizing plausible ultrasound videos from readily available, abundant ultrasound images. To this end, we intr… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

    Comments: MICCAI 2024

  5. arXiv:2503.13481  [pdf, other

    eess.SP physics.class-ph

    An On-Chip Ultra-wideband Antenna with Area-Bandwidth Optimization for Sub-Terahertz Transceivers and Radars

    Authors: Boxun Yan, Runzhou Chen, Mau-Chung Frank Chang

    Abstract: In this paper, we present an on-chip antenna at 290 GHz that achieves a maximum efficiency of 42\% on a low-resistivity silicon substrate for sub-terahertz integrated transceivers. The proposed antenna is based on a dual-slot structure to accommodate a limited ground plane and maintain desired radiation and impedance characteristics across the target frequency range. The antenna impedance bandwidt… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

    Comments: Accepted by the 2025 IEEE International Symposium on Antennas & Propagation (AP-S)

  6. arXiv:2502.10373  [pdf, other

    cs.CL cs.AI cs.LG eess.AS

    OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models

    Authors: William Chen, Jinchuan Tian, Yifan Peng, Brian Yan, Chao-Han Huck Yang, Shinji Watanabe

    Abstract: Neural scaling laws offer valuable insights for designing robust sequence processing architectures. While these laws have been extensively characterized in other modalities, their behavior in speech remains comparatively underexplored. In this work, we introduce OWLS, an open-access, reproducible suite of multilingual speech recognition and translation models spanning 0.25B to 18B parameters, with… ▽ More

    Submitted 14 February, 2025; originally announced February 2025.

    Comments: 23 pages, 13 figures

  7. arXiv:2501.00064  [pdf, other

    cs.SD cs.LG eess.AS

    Lungmix: A Mixup-Based Strategy for Generalization in Respiratory Sound Classification

    Authors: Shijia Ge, Weixiang Zhang, Shuzhao Xie, Baixu Yan, Zhi Wang

    Abstract: Respiratory sound classification plays a pivotal role in diagnosing respiratory diseases. While deep learning models have shown success with various respiratory sound datasets, our experiments indicate that models trained on one dataset often fail to generalize effectively to others, mainly due to data collection and annotation \emph{inconsistencies}. To address this limitation, we introduce \emph… ▽ More

    Submitted 29 December, 2024; originally announced January 2025.

    Comments: 4pages, 3 figures, conference paper

  8. arXiv:2409.18428  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking

    Authors: Brian Yan, Vineel Pratap, Shinji Watanabe, Michael Auli

    Abstract: Multilingual Automatic Speech Recognition (ASR) models are typically evaluated in a setting where the ground-truth language of the speech utterance is known, however, this is often not the case for most practical settings. Automatic Spoken Language Identification (SLID) models are not perfect and misclassifications have a substantial impact on the final ASR accuracy. In this paper, we present a si… ▽ More

    Submitted 26 September, 2024; originally announced September 2024.

  9. arXiv:2409.06468  [pdf

    cs.CL cs.AI cs.SD eess.AS

    An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition

    Authors: Yi-Cheng Wang, Li-Ting Pai, Bi-Cheng Yan, Hsin-Wei Wang, Chi-Han Lin, Berlin Chen

    Abstract: End-to-end (E2E) automatic speech recognition (ASR) models have become standard practice for various commercial applications. However, in real-world scenarios, the long-tailed nature of word distribution often leads E2E ASR models to perform well on common words but fall short in recognizing uncommon ones. Recently, the notion of a contextual adapter (CA) was proposed to infuse external knowledge… ▽ More

    Submitted 10 September, 2024; originally announced September 2024.

    Comments: Accepted by SLT 2024

  10. arXiv:2407.18465  [pdf

    cond-mat.mes-hall eess.SY

    Multiphysics Modeling on Photoconductive Antennas for Terahertz Applications

    Authors: Boxun Yan, Bundel Pooja, Chi-Hou Chan, Mau-Chung Frank Chang

    Abstract: Terahertz lies at the juncture between RF and optical electromagnetism, serving as a transition from mm-Wave to infrared photonics. Terahertz technology has been used for industrial quality control, security imaging, and high-speed communications, and often generated through optoelectronic solutions by using photoconductive antennas. In this paper, Multiphysics simulations on semi insulating GaAs,… ▽ More

    Submitted 25 July, 2024; originally announced July 2024.

    Comments: 3 pages, 4 figures, accepted by 2024 IEEE MTT-S International Conference on Numerical Electromagnetic and Multiphysics Modeling and Optimization (NEMO'2024)

  11. arXiv:2406.02950  [pdf, other

    eess.AS cs.CL cs.SD

    Joint Beam Search Integrating CTC, Attention, and Transducer Decoders

    Authors: Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Brian Yan, Jiatong Shi, Yifan Peng, Shinji Watanabe

    Abstract: End-to-end automatic speech recognition (E2E-ASR) can be classified by its decoder architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and Mask-CTC models. Each decoder architecture has advantages and disadvantages, leading practitioners to switch between these different models depending on application re… ▽ More

    Submitted 14 January, 2025; v1 submitted 5 June, 2024; originally announced June 2024.

    Comments: accepted to IEEE/ACM Transactions on Audio Speech and Language Processing

  12. arXiv:2406.02859   

    eess.AS cs.SD

    ConPCO: Preserving Phoneme Characteristics for Automatic Pronunciation Assessment Leveraging Contrastive Ordinal Regularization

    Authors: Bi-Cheng Yan, Wei-Cheng Chao, Jiun-Ting Li, Yi-Cheng Wang, Hsin-Wei Wang, Meng-Shin Lin, Berlin Chen

    Abstract: Automatic pronunciation assessment (APA) manages to evaluate the pronunciation proficiency of a second language (L2) learner in a target language. Existing efforts typically draw on regression models for proficiency score prediction, where the models are trained to estimate target values without explicitly accounting for phoneme-awareness in the feature space. In this paper, we propose a contrasti… ▽ More

    Submitted 8 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: This paper has been withdrawn because the authors aim to achieve better organization in writing and more detailed experimental analysis

  13. arXiv:2404.09149  [pdf, other

    eess.SY cs.NE math.NA

    Heuristic Solution to Joint Deployment and Beamforming Design for STAR-RIS Aided Networks

    Authors: Bai Yan, Qi Zhao, Jin Zhang, J. Andrew Zhang

    Abstract: This paper tackles the deployment challenges of Simultaneous Transmitting and Reflecting Reconfigurable Intelligent Surface (STAR-RIS) in communication systems. Unlike existing works that use fixed deployment setups or solely optimize the location, this paper emphasizes the joint optimization of the location and orientation of STAR-RIS. This enables searching across all user grouping possibilities… ▽ More

    Submitted 14 April, 2024; originally announced April 2024.

    Comments: 30 pages

  14. arXiv:2403.12695  [pdf, other

    eess.IV cs.CV cs.LG

    Federated Semi-supervised Learning for Medical Image Segmentation with intra-client and inter-client Consistency

    Authors: Yubin Zheng, Peng Tang, Tianjie Ju, Weidong Qiu, Bo Yan

    Abstract: Medical image segmentation plays a vital role in clinic disease diagnosis and medical image analysis. However, labeling medical images for segmentation task is tough due to the indispensable domain expertise of radiologists. Furthermore, considering the privacy and sensitivity of medical images, it is impractical to build a centralized segmentation dataset from different medical institutions. Fede… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: Working in progress

  15. arXiv:2401.16658  [pdf, ps, other

    cs.CL eess.AS

    OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

    Authors: Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe

    Abstract: Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions of OWSM (v1 to v3) are still based on standard Transformer, which might lead to inferior performance compared to state-of-the-art speech encoder archite… ▽ More

    Submitted 26 August, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: Accepted at INTERSPEECH 2024. Webpage: https://www.wavlab.org/activities/2024/owsm/

  16. arXiv:2311.06079  [pdf

    cs.CV eess.IV

    Enhancing Rock Image Segmentation in Digital Rock Physics: A Fusion of Generative AI and State-of-the-Art Neural Networks

    Authors: Zhaoyang Ma, Xupeng He, Hyung Kwak, Jun Gao, Shuyu Sun, Bicheng Yan

    Abstract: In digital rock physics, analysing microstructures from CT and SEM scans is crucial for estimating properties like porosity and pore connectivity. Traditional segmentation methods like thresholding and CNNs often fall short in accurately detailing rock microstructures and are prone to noise. U-Net improved segmentation accuracy but required many expert-annotated samples, a laborious and error-pron… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

  17. arXiv:2310.01839  [pdf

    eess.AS cs.CL cs.SD

    Preserving Phonemic Distinctions for Ordinal Regression: A Novel Loss Function for Automatic Pronunciation Assessment

    Authors: Bi-Cheng Yan, Hsin-Wei Wang, Yi-Cheng Wang, Jiun-Ting Li, Chi-Han Lin, Berlin Chen

    Abstract: Automatic pronunciation assessment (APA) manages to quantify the pronunciation proficiency of a second language (L2) learner in a language. Prevailing approaches to APA normally leverage neural models trained with a regression loss function, such as the mean-squared error (MSE) loss, for proficiency level prediction. Despite most regression models can effectively capture the ordinality of proficie… ▽ More

    Submitted 4 October, 2023; v1 submitted 3 October, 2023; originally announced October 2023.

    Comments: Accepted by ASRU 2023

  18. arXiv:2309.15826  [pdf, other

    cs.CL cs.SD eess.AS

    Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing

    Authors: Brian Yan, Xuankai Chang, Antonios Anastasopoulos, Yuya Fujita, Shinji Watanabe

    Abstract: Recent works in end-to-end speech-to-text translation (ST) have proposed multi-tasking methods with soft parameter sharing which leverage machine translation (MT) data via secondary encoders that map text inputs to an eventual cross-modal representation. In this work, we instead propose a ST/MT multi-tasking framework with hard parameter sharing in which all model parameters are shared cross-modal… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  19. arXiv:2309.15800  [pdf, other

    cs.CL cs.SD eess.AS

    Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

    Authors: Xuankai Chang, Brian Yan, Kwanghee Choi, Jeeweon Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang

    Abstract: Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning repre… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Submitted to IEEE ICASSP 2024

  20. arXiv:2309.15686  [pdf, other

    cs.CL cs.SD eess.AS

    Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization

    Authors: Amir Hussein, Brian Yan, Antonios Anastasopoulos, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: Incorporating longer context has been shown to benefit machine translation, but the inclusion of context in end-to-end speech translation (E2E-ST) remains under-studied. To bridge this gap, we introduce target language context in E2E-ST, enhancing coherence and overcoming memory constraints of extended audio segments. Additionally, we propose context dropout to ensure robustness to the absence of… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  21. arXiv:2309.15674  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Speech collage: code-switched audio generation by collaging monolingual corpora

    Authors: Amir Hussein, Dorsa Zeinali, Ondřej Klejch, Matthew Wiesner, Brian Yan, Shammur Chowdhury, Ahmed Ali, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: Designing effective automatic speech recognition (ASR) systems for Code-Switching (CS) often depends on the availability of the transcribed CS resources. To address data scarcity, this paper introduces Speech Collage, a method that synthesizes CS data from monolingual corpora by splicing audio segments. We further improve the smoothness quality of audio generation using an overlap-add approach. We… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  22. arXiv:2309.15317  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning

    Authors: William Chen, Jiatong Shi, Brian Yan, Dan Berrebbi, Wangyou Zhang, Yifan Peng, Xuankai Chang, Soumi Maiti, Shinji Watanabe

    Abstract: Multilingual self-supervised learning (SSL) has often lagged behind state-of-the-art (SOTA) methods due to the expenses and complexity required to handle many languages. This further harms the reproducibility of SSL, which is already limited to few research groups due to its resource usage. We show that more powerful techniques can actually lead to more efficient pre-training, opening SSL to more… ▽ More

    Submitted 27 September, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Accepted to ASRU 2023

  23. arXiv:2309.13876  [pdf, other

    cs.CL cs.SD eess.AS

    Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

    Authors: Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe

    Abstract: Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessib… ▽ More

    Submitted 24 October, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted at ASRU 2023

  24. arXiv:2309.11379  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff

    Authors: Peter Polák, Brian Yan, Shinji Watanabe, Alex Waibel, Ondřej Bojar

    Abstract: Blockwise self-attentional encoder models have recently emerged as one promising end-to-end approach to simultaneous speech translation. These models employ a blockwise beam search with hypothesis reliability scoring to determine when to wait for more input speech before translating further. However, this method maintains multiple hypotheses until the entire speech input is consumed -- this scheme… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

    Comments: Accepted at INTERSPEECH 2023

    Journal ref: Polák, P., Yan, B., Watanabe, S., Waibel, A., Bojar, O. (2023) Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff. Proc. INTERSPEECH 2023, 3979-3983

  25. arXiv:2309.03520  [pdf, ps, other

    eess.SY

    Deep Reinforcement Learning Enabled Joint Deployment and Beamforming in STAR-RIS Assisted Networks

    Authors: Zhuoyuan Ma, Qi Zhao, Bai Yan, Jin Zhang

    Abstract: In the new generation of wireless communication systems, reconfigurable intelligent surfaces (RIS) and simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) have become competitive network components to achieve intelligent and reconfigurable network environments. However, existing work has not fully studied the deployment freedom of STAR-RIS, which limits furthe… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

    Comments: 21pages, 7 figures

    MSC Class: G.1.6; I.2.8

  26. arXiv:2308.10157  [pdf, ps, other

    eess.IV cs.CV

    Contrastive Diffusion Model with Auxiliary Guidance for Coarse-to-Fine PET Reconstruction

    Authors: Zeyu Han, Yuhan Wang, Luping Zhou, Peng Wang, Binyu Yan, Jiliu Zhou, Yan Wang, Dinggang Shen

    Abstract: To obtain high-quality positron emission tomography (PET) scans while reducing radiation exposure to the human body, various approaches have been proposed to reconstruct standard-dose PET (SPET) images from low-dose PET (LPET) images. One widely adopted technique is the generative adversarial networks (GANs), yet recently, diffusion probabilistic models (DPMs) have emerged as a compelling alternat… ▽ More

    Submitted 20 August, 2023; originally announced August 2023.

    Comments: Accepted and presented in MICCAI 2023. To be published in Proceedings

  27. arXiv:2307.13643  [pdf, other

    cs.CR cs.SD eess.AS

    Backdoor Attacks against Voice Recognition Systems: A Survey

    Authors: Baochen Yan, Jiahe Lan, Zheng Yan

    Abstract: Voice Recognition Systems (VRSs) employ deep learning for speech recognition and speaker recognition. They have been widely deployed in various real-world applications, from intelligent voice assistance to telephony surveillance and biometric authentication. However, prior research has revealed the vulnerability of VRSs to backdoor attacks, which pose a significant threat to the security and priva… ▽ More

    Submitted 22 July, 2023; originally announced July 2023.

    Comments: 33 pages, 7 figures

  28. arXiv:2307.11005  [pdf, other

    cs.CL cs.SD eess.AS

    Integrating Pretrained ASR and LM to Perform Sequence Generation for Spoken Language Understanding

    Authors: Siddhant Arora, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, Shinji Watanabe

    Abstract: There has been an increased interest in the integration of pretrained speech recognition (ASR) and language models (LM) into the SLU framework. However, prior methods often struggle with a vocabulary mismatch between pretrained models, and LM cannot be directly utilized as they diverge from its NLU formulation. In this study, we propose a three-pass end-to-end (E2E) SLU system that effectively int… ▽ More

    Submitted 20 July, 2023; originally announced July 2023.

    Comments: Accepted at INTERSPEECH 2023

  29. arXiv:2307.09794  [pdf

    eess.IV cs.CV physics.med-ph

    DiffDP: Radiotherapy Dose Prediction via a Diffusion Model

    Authors: Zhenghao Feng, Lu Wen, Peng Wang, Binyu Yan, Xi Wu, Jiliu Zhou, Yan Wang

    Abstract: Currently, deep learning (DL) has achieved the automatic prediction of dose distribution in radiotherapy planning, enhancing its efficiency and quality. However, existing methods suffer from the over-smoothing problem for their commonly used L_1 or L_2 loss with posterior average calculations. To alleviate this limitation, we innovatively introduce a diffusion-based dose prediction (DiffDP) model… ▽ More

    Submitted 19 July, 2023; originally announced July 2023.

    Comments: to be published in MICCAI 2023

  30. arXiv:2306.01247  [pdf, other

    eess.AS

    Tensor decomposition for minimization of E2E SLU model toward on-device processing

    Authors: Yosuke Kashiwagi, Siddhant Arora, Hayato Futami, Jessica Huynh, Shih-Lun Wu, Yifan Peng, Brian Yan, Emiru Tsunoo, Shinji Watanabe

    Abstract: Spoken Language Understanding (SLU) is a critical speech recognition application and is often deployed on edge devices. Consequently, on-device processing plays a significant role in the practical implementation of SLU. This paper focuses on the end-to-end (E2E) SLU model due to its small latency property, unlike a cascade system, and aims to minimize the computational cost. We reduce the model si… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted by INTERSPEECH 2023

  31. arXiv:2305.18108  [pdf, other

    cs.SD eess.AS

    Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning

    Authors: Xuankai Chang, Brian Yan, Yuya Fujita, Takashi Maekaku, Shinji Watanabe

    Abstract: Self-supervised learning (SSL) of speech has shown impressive results in speech-related tasks, particularly in automatic speech recognition (ASR). While most methods employ the output of intermediate layers of the SSL model as real-valued features for downstream tasks, there is potential in exploring alternative approaches that use discretized token sequences. This approach offers benefits such as… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: Accepted at INTERSPEECH 2023

  32. arXiv:2305.11095  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization

    Authors: Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath

    Abstract: We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering. We selected three tasks: audio-visual speech recognition (AVSR), code-switched speech recognition (CS-ASR), and speech translation (ST) on unseen language pairs. We design task-specific prompts, by either leveraging another large-scale model, or sim… ▽ More

    Submitted 15 August, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

    Comments: Interspeech 2023

  33. arXiv:2305.11073  [pdf, other

    cs.CL cs.SD eess.AS

    A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks

    Authors: Yifan Peng, Kwangyoun Kim, Felix Wu, Brian Yan, Siddhant Arora, William Chen, Jiyang Tang, Suwon Shon, Prashant Sridhar, Shinji Watanabe

    Abstract: Conformer, a convolution-augmented Transformer variant, has become the de facto encoder architecture for speech processing due to its superior performance in various tasks, including automatic speech recognition (ASR), speech translation (ST) and spoken language understanding (SLU). Recently, a new encoder called E-Branchformer has outperformed Conformer in the LibriSpeech ASR benchmark, making it… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted at INTERSPEECH 2023. Code: https://github.com/espnet/espnet

  34. arXiv:2305.01620  [pdf, ps, other

    cs.CL cs.SD eess.AS

    A Study on the Integration of Pipeline and E2E SLU systems for Spoken Semantic Parsing toward STOP Quality Challenge

    Authors: Siddhant Arora, Hayato Futami, Shih-Lun Wu, Jessica Huynh, Yifan Peng, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, Shinji Watanabe

    Abstract: Recently there have been efforts to introduce new benchmark tasks for spoken language understanding (SLU), like semantic parsing. In this paper, we describe our proposed spoken semantic parsing system for the quality track (Track 1) in Spoken Language Understanding Grand Challenge which is part of ICASSP Signal Processing Grand Challenge 2023. We experiment with both end-to-end and pipeline system… ▽ More

    Submitted 6 May, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

    Comments: First Place in Track 1 of STOP Challenge, which is part of ICASSP Signal Processing Grand Challenge 2023

  35. arXiv:2305.01194  [pdf, ps, other

    cs.CL cs.SD eess.AS

    The Pipeline System of ASR and NLU with MLM-based Data Augmentation toward STOP Low-resource Challenge

    Authors: Hayato Futami, Jessica Huynh, Siddhant Arora, Shih-Lun Wu, Yosuke Kashiwagi, Yifan Peng, Brian Yan, Emiru Tsunoo, Shinji Watanabe

    Abstract: This paper describes our system for the low-resource domain adaptation track (Track 3) in Spoken Language Understanding Grand Challenge, which is a part of ICASSP Signal Processing Grand Challenge 2023. In the track, we adopt a pipeline approach of ASR and NLU. For ASR, we fine-tune Whisper for each domain with upsampling. For NLU, we fine-tune BART on all the Track3 data and then on low-resource… ▽ More

    Submitted 11 May, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

    Comments: To appear at ICASSP2023

  36. arXiv:2305.00926  [pdf, other

    cs.CL cs.SD eess.AS

    Joint Modelling of Spoken Language Understanding Tasks with Integrated Dialog History

    Authors: Siddhant Arora, Hayato Futami, Emiru Tsunoo, Brian Yan, Shinji Watanabe

    Abstract: Most human interactions occur in the form of spoken conversations where the semantic meaning of a given utterance depends on the context. Each utterance in spoken conversation can be represented by many semantic and speaker attributes, and there has been an interest in building Spoken Language Understanding (SLU) systems for automatically predicting these attributes. Recent work has shown that inc… ▽ More

    Submitted 1 May, 2023; originally announced May 2023.

    Comments: Accepted at ICASSP 20223

  37. arXiv:2304.13583  [pdf, other

    eess.IV cs.CV

    Multi-Modality Deep Network for Extreme Learned Image Compression

    Authors: Xuhao Jiang, Weimin Tan, Tian Tan, Bo Yan, Liquan Shen

    Abstract: Image-based single-modality compression learning approaches have demonstrated exceptionally powerful encoding and decoding capabilities in the past few years , but suffer from blur and severe semantics loss at extremely low bitrates. To address this issue, we propose a multimodal machine learning method for text-guided image compression, in which the semantic information of text is used as prior i… ▽ More

    Submitted 26 April, 2023; originally announced April 2023.

    Comments: 13 pages, 14 figures, accepted by AAAI 2023

  38. arXiv:2304.07990  [pdf, other

    eess.SY

    Novel Quality Measure and Efficient Resolution of Convex Hull Pricing for Unit Commitment

    Authors: Mikhail A. Bragin, Farhan Hyder, Bing Yan, Peter B. Luh, Jinye Zhao, Feng Zhao, Dane A. Schiro, Tongxin Zheng

    Abstract: Electricity prices determined by economic dispatch that do not consider fixed costs may lead to significant uplift payments. However, when fixed costs are included, prices become non-monotonic with respect to demand, which can adversely impact market transparency. To overcome this issue, convex hull (CH) pricing has been introduced for unit commitment with fixed costs. Several CH pricing methods h… ▽ More

    Submitted 17 April, 2023; originally announced April 2023.

  39. arXiv:2304.04596  [pdf, other

    cs.SD cs.CL eess.AS

    ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit

    Authors: Brian Yan, Jiatong Shi, Yun Tang, Hirofumi Inaguma, Yifan Peng, Siddharth Dalmia, Peter Polák, Patrick Fernandes, Dan Berrebbi, Tomoki Hayashi, Xiaohui Zhang, Zhaoheng Ni, Moto Hira, Soumi Maiti, Juan Pino, Shinji Watanabe

    Abstract: ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) -- each task is supported with a wide variety of approaches, differentiating ESPnet-… ▽ More

    Submitted 6 July, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

    Comments: ACL 2023; System Demonstration

  40. arXiv:2302.12829  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Massively Multilingual ASR With Auxiliary CTC Objectives

    Authors: William Chen, Brian Yan, Jiatong Shi, Yifan Peng, Soumi Maiti, Shinji Watanabe

    Abstract: Multilingual Automatic Speech Recognition (ASR) models have extended the usability of speech technologies to a wide variety of languages. With how many languages these models have to handle, however, a key to understanding their imbalanced performance across different languages is to examine if the model actually knows which language it should transcribe. In this paper, we introduce our work on im… ▽ More

    Submitted 27 February, 2023; v1 submitted 24 February, 2023; originally announced February 2023.

    Comments: 5 pages, 1 figure, accepted at ICASSP 2023; fixed typo and URL in abstract

  41. arXiv:2212.10818  [pdf, other

    cs.SD cs.CL eess.AS

    4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

    Authors: Yui Sudo, Muhammad Shakeel, Brian Yan, Jiatong Shi, Shinji Watanabe

    Abstract: The network architecture of end-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and non-autoregressive mask-predict models. Since each of these network architectures has pros and cons, a typical use case is to switch these separate models d… ▽ More

    Submitted 29 May, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

    Comments: Accepted by INTERRSPEECH2023

  42. arXiv:2211.08989  [pdf, other

    cs.CL cs.SD eess.AS

    Avoid Overthinking in Self-Supervised Models for Speech Recognition

    Authors: Dan Berrebbi, Brian Yan, Shinji Watanabe

    Abstract: Self-supervised learning (SSL) models reshaped our approach to speech, language and vision. However their huge size and the opaque relations between their layers and tasks result in slow inference and network overthinking, where predictions made from the last layer of large models is worse than those made from intermediate layers. Early exit (EE) strategies can solve both issues by dynamically red… ▽ More

    Submitted 1 November, 2022; originally announced November 2022.

  43. arXiv:2211.05967  [pdf, ps, other

    cs.CL eess.AS

    Align, Write, Re-order: Explainable End-to-End Speech Translation via Operation Sequence Generation

    Authors: Motoi Omachi, Brian Yan, Siddharth Dalmia, Yuya Fujita, Shinji Watanabe

    Abstract: The black-box nature of end-to-end speech translation (E2E ST) systems makes it difficult to understand how source language inputs are being mapped to the target language. To solve this problem, we would like to simultaneously generate automatic speech recognition (ASR) and ST predictions such that each source language word is explicitly mapped to a target language word. A major challenge arises f… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

  44. arXiv:2211.01458  [pdf, other

    cs.CL cs.SD eess.AS

    Towards Zero-Shot Code-Switched Speech Recognition

    Authors: Brian Yan, Matthew Wiesner, Ondrej Klejch, Preethi Jyothi, Shinji Watanabe

    Abstract: In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot setting where no transcribed CS speech data is available for training. Previously proposed frameworks which conditionally factorize the bilingual task into its constituent monolingual parts are a promising starting point for leveraging monolingual data efficiently. However, th… ▽ More

    Submitted 9 November, 2022; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: 5 pages

  45. arXiv:2210.16663  [pdf, other

    eess.AS cs.CL

    BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

    Authors: Yosuke Higuchi, Brian Yan, Siddhant Arora, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

    Abstract: This paper presents BERT-CTC, a novel formulation of end-to-end speech recognition that adapts BERT for connectionist temporal classification (CTC). Our formulation relaxes the conditional independence assumptions used in conventional CTC and incorporates linguistic knowledge through the explicit output dependency obtained by BERT contextual embedding. BERT-CTC attends to the full contexts of the… ▽ More

    Submitted 19 April, 2023; v1 submitted 29 October, 2022; originally announced October 2022.

    Comments: v1: Accepted to Findings of EMNLP2022, v2: Minor corrections and clearer derivation of Eq. (21)

  46. arXiv:2210.15734  [pdf, other

    cs.CL cs.SD eess.AS

    Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models

    Authors: Siddhant Arora, Siddharth Dalmia, Brian Yan, Florian Metze, Alan W Black, Shinji Watanabe

    Abstract: End-to-end spoken language understanding (SLU) systems are gaining popularity over cascaded approaches due to their simplicity and ability to avoid error propagation. However, these systems model sequence labeling as a sequence prediction task causing a divergence from its well-established token-level tagging formulation. We build compositional end-to-end SLU systems that explicitly separate the a… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.

    Comments: Accepted at EMNLP 2022 Findings. Our code and models will be publicly available as part of the ESPnet-SLU toolkit: https://github.com/espnet/espnet and the release can be followed here: https://github.com/espnet/espnet/pull/4735

  47. arXiv:2210.07499  [pdf, other

    cs.CL cs.SD eess.AS

    Bayes risk CTC: Controllable CTC alignment in Sequence-to-Sequence tasks

    Authors: Jinchuan Tian, Brian Yan, Jianwei Yu, Chao Weng, Dong Yu, Shinji Watanabe

    Abstract: Sequence-to-Sequence (seq2seq) tasks transcribe the input sequence to a target sequence. The Connectionist Temporal Classification (CTC) criterion is widely used in multiple seq2seq tasks. Besides predicting the target sequence, a side product of CTC is to predict the alignment, which is the most probable input-long sequence that specifies a hard aligning relationship between the input and target… ▽ More

    Submitted 31 January, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

    Journal ref: International Conference on Learning Representations (ICLR), 2023

  48. arXiv:2210.05200  [pdf, other

    cs.CL cs.SD eess.AS

    CTC Alignments Improve Autoregressive Translation

    Authors: Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham Neubig, Florian Metze, Alan W Black, Shinji Watanabe

    Abstract: Connectionist Temporal Classification (CTC) is a widely used approach for automatic speech recognition (ASR) that performs conditionally independent monotonic alignment. However for translation, CTC exhibits clear limitations due to the contextual and non-monotonic nature of the task and thus lags behind attentional decoder approaches in terms of translation quality. In this work, we argue that CT… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

  49. arXiv:2207.09514  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

    Authors: Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell, Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu Tsao, Yanmin Qian, Shinji Watanabe

    Abstract: This paper presents recent progress on integrating speech separation and enhancement (SSE) into the ESPnet toolkit. Compared with the previous ESPnet-SE work, numerous features have been added, including recent state-of-the-art speech enhancement models with their respective training and evaluation recipes. Importantly, a new interface has been designed to flexibly combine speech enhancement front… ▽ More

    Submitted 19 July, 2022; originally announced July 2022.

    Comments: To appear in Interspeech 2022

  50. arXiv:2207.06670  [pdf, other

    cs.CL cs.SD eess.AS

    Two-Pass Low Latency End-to-End Spoken Language Understanding

    Authors: Siddhant Arora, Siddharth Dalmia, Xuankai Chang, Brian Yan, Alan Black, Shinji Watanabe

    Abstract: End-to-end (E2E) models are becoming increasingly popular for spoken language understanding (SLU) systems and are beginning to achieve competitive performance to pipeline-based approaches. However, recent work has shown that these models struggle to generalize to new phrasings for the same intent indicating that models cannot understand the semantic content of the given utterance. In this work, we… ▽ More

    Submitted 29 July, 2022; v1 submitted 14 July, 2022; originally announced July 2022.

    Comments: INTERSPEECH 2022