Skip to main content

Showing 1–50 of 254 results for author: Shi, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.05666  [pdf, ps, other

    cs.CV eess.IV

    Knowledge-guided Complex Diffusion Model for PolSAR Image Classification in Contourlet Domain

    Authors: Junfei Shi, Yu Cheng, Haiyan Jin, Junhuai Li, Zhaolin Xiao, Maoguo Gong, Weisi Lin

    Abstract: Diffusion models have demonstrated exceptional performance across various domains due to their ability to model and generate complicated data distributions. However, when applied to PolSAR data, traditional real-valued diffusion models face challenges in capturing complex-valued phase information.Moreover, these models often struggle to preserve fine structural details. To address these limitation… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  2. arXiv:2507.04048  [pdf, ps, other

    cs.SD eess.AS

    CLEP-DG: Contrastive Learning for Speech Emotion Domain Generalization via Soft Prompt Tuning

    Authors: Jiacheng Shi, Yanfu Zhang, Ye Gao

    Abstract: Speech Emotion Recognition (SER) is fundamental to affective computing and human-computer interaction, yet existing models struggle to generalize across diverse acoustic conditions. While Contrastive Language-Audio Pretraining (CLAP) provides strong multimodal alignment, it lacks dedicated mechanisms for capturing emotional cues, making it suboptimal for SER. To address this, we propose CLEP-DG, a… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

    Comments: Accepted to Interspeech2025

  3. arXiv:2507.01173  [pdf

    eess.SY

    An Adaptive Estimation Approach based on Fisher Information to Overcome the Challenges of LFP Battery SOC Estimation

    Authors: Junzhe Shi, Shida Jiang, Shengyu Tao, Jaewong Lee, Manashita Borah, Scott Moura

    Abstract: Robust and Real-time State of Charge (SOC) estimation is essential for Lithium Iron Phosphate (LFP) batteries, which are widely used in electric vehicles (EVs) and energy storage systems due to safety and longevity. However, the flat Open Circuit Voltage (OCV)-SOC curve makes this task particularly challenging. This challenge is complicated by hysteresis effects, and real-world conditions such as… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  4. arXiv:2506.17611  [pdf, ps, other

    cs.CL cs.SD eess.AS

    OpusLM: A Family of Open Unified Speech Language Models

    Authors: Jinchuan Tian, William Chen, Yifan Peng, Jiatong Shi, Siddhant Arora, Shikhar Bharadwaj, Takashi Maekaku, Yusuke Shinohara, Keita Goto, Xiang Yue, Huck Yang, Shinji Watanabe

    Abstract: This paper presents Open Unified Speech Language Models (OpusLMs), a family of open foundational speech language models (SpeechLMs) up to 7B. Initialized from decoder-only text language models, the OpusLMs are continuously pre-trained on 213K hours of speech-text pairs and 292B text-only tokens. We demonstrate our OpusLMs achieve comparable (or even superior) performance with existing SpeechLMs in… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

  5. arXiv:2506.17027  [pdf, ps, other

    cs.CV eess.IV

    Unsupervised Image Super-Resolution Reconstruction Based on Real-World Degradation Patterns

    Authors: Yiyang Tie, Hong Zhu, Yunyun Luo, Jing Shi

    Abstract: The training of real-world super-resolution reconstruction models heavily relies on datasets that reflect real-world degradation patterns. Extracting and modeling degradation patterns for super-resolution reconstruction using only real-world low-resolution (LR) images remains a challenging task. When synthesizing datasets to simulate real-world degradation, relying solely on degradation extraction… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

  6. arXiv:2506.12260  [pdf, ps, other

    cs.SD eess.AS

    Improving Speech Enhancement with Multi-Metric Supervision from Learned Quality Assessment

    Authors: Wei Wang, Wangyou Zhang, Chenda Li, Jiatong Shi, Shinji Watanabe, Yanmin Qian

    Abstract: Speech quality assessment (SQA) aims to predict the perceived quality of speech signals under a wide range of distortions. It is inherently connected to speech enhancement (SE), which seeks to improve speech quality by removing unwanted signal components. While SQA models are widely used to evaluate SE performance, their potential to guide SE training remains underexplored. In this work, we invest… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: Submitted to ASRU 2025

  7. arXiv:2506.10274  [pdf, ps, other

    cs.SD cs.AI cs.CL eess.AS

    Discrete Audio Tokens: More Than a Survey!

    Authors: Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli

    Abstract: Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs).… ▽ More

    Submitted 16 June, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

  8. arXiv:2506.00722  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Chain-of-Thought Training for Open E2E Spoken Dialogue Systems

    Authors: Siddhant Arora, Jinchuan Tian, Hayato Futami, Jee-weon Jung, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

    Abstract: Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions. However, existing E2E approaches often require large-scale training data and generates responses lacking semantic coherence. We propose a simple yet effective strategy leveraging a chain-of-th… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

    Comments: Accepted at INTERSPEECH 2025

  9. arXiv:2505.24518  [pdf, ps, other

    cs.SD cs.MM eess.AS

    ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation

    Authors: Jiatong Shi, Yifan Cheng, Bo-Hao Su, Hye-jin Shim, Jinchuan Tian, Samuele Cornell, Yiwen Zhao, Siddhant Arora, Shinji Watanabe

    Abstract: Speech signal analysis poses significant challenges, particularly in tasks such as speech quality evaluation and profiling, where the goal is to predict multiple perceptual and objective metrics. For instance, metrics like PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), and MOS (Mean Opinion Score) each capture different aspects of speech quality. Howev… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  10. arXiv:2505.20741  [pdf, ps, other

    cs.SD eess.AS

    Uni-VERSA: Versatile Speech Assessment with a Unified Network

    Authors: Jiatong Shi, Hye-Jin Shim, Shinji Watanabe

    Abstract: Subjective listening tests remain the golden standard for speech quality assessment, but are costly, variable, and difficult to scale. In contrast, existing objective metrics, such as PESQ, F0 correlation, and DNSMOS, typically capture only specific aspects of speech quality. To address these limitations, we introduce Uni-VERSA, a unified network that simultaneously predicts various objective metr… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: Accepted by Interspeech

  11. arXiv:2505.19119  [pdf, other

    cs.SD cs.AI eess.AS

    CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning

    Authors: Renyuan Li, Zhibo Liang, Haichuan Zhang, Tianyu Shi, Zhiyuan Cheng, Jia Shi, Carl Yang, Mingjie Tang

    Abstract: Recent breakthroughs in text-to-speech (TTS) voice cloning have raised serious privacy concerns, allowing highly accurate vocal identity replication from just a few seconds of reference audio, while retaining the speaker's vocal authenticity. In this paper, we introduce CloneShield, a universal time-domain adversarial perturbation framework specifically designed to defend against zero-shot voice c… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    Comments: 10pages, 4figures

  12. arXiv:2505.15772  [pdf, ps, other

    cs.SD cs.CL eess.AS

    MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling

    Authors: Yifan Cheng, Ruoyi Zhang, Jiatong Shi

    Abstract: Acquiring large-scale emotional speech data with strong consistency remains a challenge for speech synthesis. This paper presents MIKU-PAL, a fully automated multimodal pipeline for extracting high-consistency emotional speech from unlabeled video data. Leveraging face detection and tracking algorithms, we developed an automatic emotion analysis system using a multimodal large language model (MLLM… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: Accepted by Interspeech

  13. arXiv:2504.13131  [pdf, other

    eess.IV cs.AI cs.CV

    NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results

    Authors: Xin Li, Kun Yuan, Bingchen Li, Fengbin Guan, Yizhen Shao, Zihao Yu, Xijun Wang, Yiting Lu, Wei Luo, Suhang Yao, Ming Sun, Chao Zhou, Zhibo Chen, Radu Timofte, Yabin Zhang, Ao-Xiang Zhang, Tianwu Zhi, Jianzhao Liu, Yang Li, Jingwen Xu, Yiting Liao, Yushen Zuo, Mingyang Wu, Renjie Li, Shengyun Zhong , et al. (88 additional authors not shown)

    Abstract: This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating re… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: Challenge Report of NTIRE 2025; Methods from 18 Teams; Accepted by CVPR Workshop; 21 pages

  14. arXiv:2504.04182  [pdf, other

    eess.SY

    Model Predictive Building Climate Control for Mitigating Heat Pump Noise Pollution (Extended Version)

    Authors: Yun Li, Jicheng Shi, Colin N. Jones, Neil Yorke-Smith, Tamas Keviczky

    Abstract: Noise pollution from heat pumps (HPs) has been an emerging concern to their broader adoption, especially in densely populated areas. This paper explores a model predictive control (MPC) approach for building climate control, aimed at minimizing the noise nuisance generated by HPs. By exploiting a piecewise linear approximation of HP noise patterns and assuming linear building thermal dynamics, the… ▽ More

    Submitted 5 April, 2025; originally announced April 2025.

    Comments: 7 pages, accepted to ECC2025

  15. arXiv:2504.02007  [pdf, other

    eess.IV

    OccludeNeRF: Geometric-aware 3D Scene Inpainting with Collaborative Score Distillation in NeRF

    Authors: Jingyu Shi, Achleshwar Luthra, Jiazhi Li, Xiang Gao, Xiyun Song, Zongfang Lin, David Gu, Heather Yu

    Abstract: With Neural Radiance Fields (NeRFs) arising as a powerful 3D representation, research has investigated its various downstream tasks, including inpainting NeRFs with 2D images. Despite successful efforts addressing the view consistency and geometry quality, prior methods yet suffer from occlusion in NeRF inpainting tasks, where 2D prior is severely limited in forming a faithful reconstruction of th… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

    Comments: CVPR 2025 CV4Metaverse

  16. arXiv:2503.24169  [pdf, other

    eess.SY

    Disturbance-adaptive Model Predictive Control for Bounded Average Constraint Violations

    Authors: Jicheng Shi, Colin N. Jones

    Abstract: This paper considers stochastic linear time-invariant systems subject to constraints on the average number of state-constraint violations over time without knowing the disturbance distribution. We present a novel disturbance-adaptive model predictive control (DAD-MPC) framework, which adjusts the disturbance model based on measured constraint violations. Using a robust invariance method, DAD-MPC e… ▽ More

    Submitted 7 May, 2025; v1 submitted 31 March, 2025; originally announced March 2025.

  17. arXiv:2503.19143  [pdf, other

    eess.SP

    Joint Sparse Graph for Enhanced MIMO-AFDM Receiver Design

    Authors: Qu Luo, Jing Zhu, Zilong Liu, Yanqun Tang, Pei Xiao, Gaojie Chen, Jia Shi

    Abstract: Affine frequency division multiplexing (AFDM) is a promising chirp-assisted multicarrier waveform for future high-mobility communications. This paper is devoted to enhanced receiver design for multiple input and multiple output AFDM (MIMO-AFDM) systems. Firstly, we introduce a unified variational inference (VI) approach to approximate the target posterior distribution, under which the belief propa… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

  18. arXiv:2503.16669  [pdf, other

    cs.SD cs.AI eess.AS

    Aligning Text-to-Music Evaluation with Human Preferences

    Authors: Yichen Huang, Zachary Novack, Koichi Saito, Jiatong Shi, Shinji Watanabe, Yuki Mitsufuji, John Thickstun, Chris Donahue

    Abstract: Despite significant recent advances in generative acoustic text-to-music (TTM) modeling, robust evaluation of these models lags behind, relying in particular on the popular Fréchet Audio Distance (FAD). In this work, we rigorously study the design space of reference-based divergence metrics for evaluating TTM models through (1) designing four synthetic meta-evaluations to measure sensitivity to pa… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

  19. arXiv:2503.08533  [pdf, other

    cs.CL cs.SD eess.AS

    ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems

    Authors: Siddhant Arora, Yifan Peng, Jiatong Shi, Jinchuan Tian, William Chen, Shikhar Bharadwaj, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Shuichiro Shimizu, Vaibhav Srivastav, Shinji Watanabe

    Abstract: Advancements in audio foundation models (FMs) have fueled interest in end-to-end (E2E) spoken dialogue systems, but different web interfaces for each system makes it challenging to compare and contrast them effectively. Motivated by this, we introduce an open-source, user-friendly toolkit designed to build unified web interfaces for various cascaded and E2E spoken dialogue systems. Our demo furthe… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: Accepted at NAACL 2025 Demo Track

  20. arXiv:2503.06686  [pdf, other

    eess.IV cs.CV

    ImplicitCell: Resolution Cell Modeling of Joint Implicit Volume Reconstruction and Pose Refinement in Freehand 3D Ultrasound

    Authors: Sheng Song, Yiting Chen, Duo Xu, Songhan Ge, Yunqian Huang, Junni Shi, Man Chen, Hongbo Chen, Rui Zheng

    Abstract: Freehand 3D ultrasound enables volumetric imaging by tracking a conventional ultrasound probe during freehand scanning, offering enriched spatial information that improves clinical diagnosis. However, the quality of reconstructed volumes is often compromised by tracking system noise and irregular probe movements, leading to artifacts in the final reconstruction. To address these challenges, we pro… ▽ More

    Submitted 9 March, 2025; originally announced March 2025.

  21. arXiv:2502.16897  [pdf, other

    eess.AS

    Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM

    Authors: Jiatong Shi, Chunlei Zhang, Jinchuan Tian, Junrui Ni, Hao Zhang, Shinji Watanabe, Dong Yu

    Abstract: Recent efforts have extended textual LLMs to the speech domain. Yet, a key challenge remains, which is balancing speech understanding and generation while avoiding catastrophic forgetting when integrating acoustically rich codec-based representations into models originally trained on text. In this work, we propose a novel approach that leverages continual pre-training (CPT) on a pre-trained textua… ▽ More

    Submitted 24 February, 2025; originally announced February 2025.

  22. arXiv:2502.15218  [pdf, other

    cs.CL cs.SD eess.AS

    ESPnet-SpeechLM: An Open Speech Language Model Toolkit

    Authors: Jinchuan Tian, Jiatong Shi, William Chen, Siddhant Arora, Yoshiki Masuyama, Takashi Maekaku, Yihan Wu, Junyi Peng, Shikhar Bharadwaj, Yiwen Zhao, Samuele Cornell, Yifan Peng, Xiang Yue, Chao-Han Huck Yang, Graham Neubig, Shinji Watanabe

    Abstract: We present ESPnet-SpeechLM, an open toolkit designed to democratize the development of speech language models (SpeechLMs) and voice-driven agentic applications. The toolkit standardizes speech processing tasks by framing them as universal sequential modeling problems, encompassing a cohesive workflow of data preprocessing, pre-training, inference, and task evaluation. With ESPnet-SpeechLM, users c… ▽ More

    Submitted 24 February, 2025; v1 submitted 21 February, 2025; originally announced February 2025.

  23. arXiv:2502.11729  [pdf, other

    eess.IV

    On Quantizing Neural Representation for Variable-Rate Video Coding

    Authors: Junqi Shi, Zhujia Chen, Hanfei Li, Qi Zhao, Ming Lu, Tong Chen, Zhan Ma

    Abstract: This work introduces NeuroQuant, a novel post-training quantization (PTQ) approach tailored to non-generalized Implicit Neural Representations for variable-rate Video Coding (INR-VC). Unlike existing methods that require extensive weight retraining for each target bitrate, we hypothesize that variable-rate coding can be achieved by adjusting quantization parameters (QPs) of pre-trained weights. Ou… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

    Comments: to be pulished in ICLR'25

  24. arXiv:2502.05445  [pdf, other

    eess.IV cs.CV

    Unsupervised Self-Prior Embedding Neural Representation for Iterative Sparse-View CT Reconstruction

    Authors: Xuanyu Tian, Lixuan Chen, Qing Wu, Chenhe Du, Jingjing Shi, Hongjiang Wei, Yuyao Zhang

    Abstract: Emerging unsupervised implicit neural representation (INR) methods, such as NeRP, NeAT, and SCOPE, have shown great potential to address sparse-view computed tomography (SVCT) inverse problems. Although these INR-based methods perform well in relatively dense SVCT reconstructions, they struggle to achieve comparable performance to supervised methods in sparser SVCT scenarios. They are prone to bei… ▽ More

    Submitted 7 February, 2025; originally announced February 2025.

    Journal ref: AAAI 2025

  25. arXiv:2501.10859  [pdf, other

    eess.SY cs.LG math.OC

    Which price to pay? Auto-tuning building MPC controller for optimal economic cost

    Authors: Jiarui Yu, Jicheng Shi, Wenjie Xu, Colin N. Jones

    Abstract: Model predictive control (MPC) controller is considered for temperature management in buildings but its performance heavily depends on hyperparameters. Consequently, MPC necessitates meticulous hyperparameter tuning to attain optimal performance under diverse contracts. However, conventional building controller design is an open-loop process without critical hyperparameter optimization, often lead… ▽ More

    Submitted 18 January, 2025; originally announced January 2025.

    Comments: 15 pages, 9 figures

  26. arXiv:2501.03737  [pdf, other

    eess.IV cs.CV

    Re-Visible Dual-Domain Self-Supervised Deep Unfolding Network for MRI Reconstruction

    Authors: Hao Zhang, Qi Wang, Jian Sun, Zhijie Wen, Jun Shi, Shihui Ying

    Abstract: Magnetic Resonance Imaging (MRI) is widely used in clinical practice, but suffered from prolonged acquisition time. Although deep learning methods have been proposed to accelerate acquisition and demonstrate promising performance, they rely on high-quality fully-sampled datasets for training in a supervised manner. However, such datasets are time-consuming and expensive-to-collect, which constrain… ▽ More

    Submitted 7 January, 2025; originally announced January 2025.

  27. arXiv:2412.17667  [pdf, other

    cs.SD cs.MM eess.AS

    VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music

    Authors: Jiatong Shi, Hye-jin Shim, Jinchuan Tian, Siddhant Arora, Haibin Wu, Darius Petermann, Jia Qi Yip, You Zhang, Yuxun Tang, Wangyou Zhang, Dareen Safar Alharthi, Yichen Huang, Koichi Saito, Jionghao Han, Yiwen Zhao, Chris Donahue, Shinji Watanabe

    Abstract: In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for various speech, audio, and music signals. The toolkit features a Pythonic interface with flexible configuration and dependency control, making it user-friendly and efficient. With full installation, VERSA offers 65 metrics with 729 metric variations based on different configurations. These metrics encompas… ▽ More

    Submitted 26 March, 2025; v1 submitted 23 December, 2024; originally announced December 2024.

  28. Technical Report: Towards Spatial Feature Regularization in Deep-Learning-Based Array-SAR Reconstruction

    Authors: Yu Ren, Xu Zhan, Yunqiao Hu, Xiangdong Ma, Liang Liu, Mou Wang, Jun Shi, Shunjun Wei, Tianjiao Zeng, Xiaoling Zhang

    Abstract: Array synthetic aperture radar (Array-SAR), also known as tomographic SAR (TomoSAR), has demonstrated significant potential for high-quality 3D mapping, particularly in urban areas.While deep learning (DL) methods have recently shown strengths in reconstruction, most studies rely on pixel-by-pixel reconstruction, neglecting spatial features like building structures, leading to artifacts such as ho… ▽ More

    Submitted 21 December, 2024; originally announced December 2024.

  29. arXiv:2412.13822  [pdf, ps, other

    eess.SP

    A Robust Anti-noise Scheme for RF Fingerprint Identification

    Authors: Junxian Shi, Linning Peng, Wentao Jing, Lingnan Xie, Haichuan Peng, Aiqun Hu

    Abstract: Radio frequency (RF) fingerprint technology is utilized for wireless device identification, extensively employed in the internet of things (IoT). The operating environment for IoT devices is challenging, with pervasive noise and distortion on the signals which blur the feature space of RF fingerprints. Consequently, the model accuracy obtained through training at high signal-to-noise ratio (SNR) s… ▽ More

    Submitted 18 December, 2024; originally announced December 2024.

    Journal ref: INCC2024

  30. arXiv:2412.12126  [pdf

    cs.DC cs.CV cs.LG eess.IV eess.SP

    Seamless Optical Cloud Computing across Edge-Metro Network for Generative AI

    Authors: Sizhe Xing, Aolong Sun, Chengxi Wang, Yizhi Wang, Boyu Dong, Junhui Hu, Xuyu Deng, An Yan, Yingjun Liu, Fangchen Hu, Zhongya Li, Ouhan Huang, Junhao Zhao, Yingjun Zhou, Ziwei Li, Jianyang Shi, Xi Xiao, Richard Penty, Qixiang Cheng, Nan Chi, Junwen Zhang

    Abstract: The rapid advancement of generative artificial intelligence (AI) in recent years has profoundly reshaped modern lifestyles, necessitating a revolutionary architecture to support the growing demands for computational power. Cloud computing has become the driving force behind this transformation. However, it consumes significant power and faces computation security risks due to the reliance on exten… ▽ More

    Submitted 1 May, 2025; v1 submitted 4 December, 2024; originally announced December 2024.

  31. arXiv:2412.09238  [pdf, ps, other

    eess.SY

    Disturbance-Adaptive Data-Driven Predictive Control: Trading Comfort Violations for Savings in Building Climate Control

    Authors: Jicheng Shi, Christophe Salzmann, Colin N. Jones

    Abstract: Model Predictive Control (MPC) has demonstrated significant potential in improving energy efficiency in building climate control, outperforming traditional controllers commonly used in modern building management systems. Among MPC variants, Data-driven Predictive Control (DPC) offers the advantage of modeling building dynamics directly from data, thereby substantially reducing commissioning effort… ▽ More

    Submitted 1 July, 2025; v1 submitted 12 December, 2024; originally announced December 2024.

  32. arXiv:2411.18217  [pdf, other

    cs.SD cs.CL eess.AS

    How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario

    Authors: Shih-Heng Wang, Zih-Ching Chen, Jiatong Shi, Ming-To Chuang, Guan-Ting Lin, Kuan-Po Huang, David Harwath, Shang-Wen Li, Hung-yi Lee

    Abstract: The utilization of speech Self-Supervised Learning (SSL) models achieves impressive performance on Automatic Speech Recognition (ASR). However, in low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages. Typical solutions like fine-tuning the SSL model suffer from high computation costs while using frozen SSL models as feature extractors… ▽ More

    Submitted 5 January, 2025; v1 submitted 27 November, 2024; originally announced November 2024.

  33. arXiv:2411.18107  [pdf, other

    cs.SD eess.AS

    Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition

    Authors: Shih-heng Wang, Jiatong Shi, Chien-yu Huang, Shinji Watanabe, Hung-yi Lee

    Abstract: Self-supervised learning (SSL) models have shown exceptional capabilities across various speech-processing tasks. Continuous SSL representations are effective but suffer from high computational and storage demands. On the other hand, discrete SSL representations, although with degraded performance, reduce transmission and storage costs, and improve input sequence efficiency through de-duplication… ▽ More

    Submitted 27 November, 2024; originally announced November 2024.

    Comments: SLT 2024

  34. arXiv:2411.17705  [pdf, other

    eess.SP cs.AI cs.LG

    EEG-DCNet: A Fast and Accurate MI-EEG Dilated CNN Classification Method

    Authors: Wei Peng, Kang Liu, Jiaxi Shi, Jianchen Hu

    Abstract: The electroencephalography (EEG)-based motor imagery (MI) classification is a critical and challenging task in brain-computer interface (BCI) technology, which plays a significant role in assisting patients with functional impairments to regain mobility. We present a novel multi-scale atrous convolutional neural network (CNN) model called EEG-dilated convolution network (DCNet) to enhance the accu… ▽ More

    Submitted 12 November, 2024; originally announced November 2024.

  35. arXiv:2411.05361  [pdf, ps, other

    cs.CL eess.AS

    Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

    Authors: Chien-yu Huang, Wei-Chih Chen, Shu-wen Yang, Andy T. Liu, Chen-An Li, Yu-Xiang Lin, Wei-Cheng Tseng, Anuj Diwan, Yi-Jen Shih, Jiatong Shi, William Chen, Chih-Kai Yang, Wenze Ren, Xuanjun Chen, Chi-Yuan Hsiao, Puyuan Peng, Shih-Heng Wang, Chun-Yi Kuan, Ke-Han Lu, Kai-Wei Chang, Fabian Ritter-Gutierrez, Kuan-Po Huang, Siddhant Arora, You-Kuan Lin, Ming To Chuang , et al. (55 additional authors not shown)

    Abstract: Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluati… ▽ More

    Submitted 9 June, 2025; v1 submitted 8 November, 2024; originally announced November 2024.

    Comments: ICLR 2025

  36. arXiv:2410.12885  [pdf, other

    eess.AS cs.CL q-bio.QM

    Exploiting Longitudinal Speech Sessions via Voice Assistant Systems for Early Detection of Cognitive Decline

    Authors: Kristin Qi, Jiatong Shi, Caroline Summerour, John A. Batsis, Xiaohui Liang

    Abstract: Mild Cognitive Impairment (MCI) is an early stage of Alzheimer's disease (AD), a form of neurodegenerative disorder. Early identification of MCI is crucial for delaying its progression through timely interventions. Existing research has demonstrated the feasibility of detecting MCI using speech collected from clinical interviews or digital devices. However, these approaches typically analyze data… ▽ More

    Submitted 15 October, 2024; originally announced October 2024.

    Comments: IEEE International Conference on E-health Networking, Application & Services

  37. HASN: Hybrid Attention Separable Network for Efficient Image Super-resolution

    Authors: Weifeng Cao, Xiaoyan Lei, Jun Shi, Wanyong Liang, Jie Liu, Zongfei Bai

    Abstract: Recently, lightweight methods for single image super-resolution (SISR) have gained significant popularity and achieved impressive performance due to limited hardware resources. These methods demonstrate that adopting residual feature distillation is an effective way to enhance performance. However, we find that using residual connections after each block increases the model's storage and computati… ▽ More

    Submitted 13 October, 2024; originally announced October 2024.

    Comments: Accepted by Visual Computer

  38. arXiv:2410.07572  [pdf

    physics.optics eess.SP

    Edge-guided inverse design of digital metamaterial-based mode multiplexers for high-capacity multi-dimensional interconnect

    Authors: Aolong Sun, Sizhe Xing, Xuyu Deng, Ruoyu Shen, An Yan, Fangchen Hu, Yuqin Yuan, Boyu Dong, Junhao Zhao, Ouhan Huang, Ziwei Li, Jianyang Shi, Yingjun Zhou, Chao Shen, Yiheng Zhao, Bingzhou Hong, Wei Chu, Junwen Zhang, Haiwen Cai, Nan Chi

    Abstract: The escalating demands of compute-intensive applications urgently necessitate the adoption of optical interconnect technologies to overcome bottlenecks in scaling computing systems. This requires fully exploiting the inherent parallelism of light across scalable dimensions for data loading. Here we experimentally demonstrate a synergy of wavelength- and mode- multiplexing combined with high-order… ▽ More

    Submitted 26 February, 2025; v1 submitted 9 October, 2024; originally announced October 2024.

  39. arXiv:2409.15897  [pdf, ps, other

    eess.AS cs.SD

    ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech

    Authors: Jiatong Shi, Jinchuan Tian, Yihan Wu, Jee-weon Jung, Jia Qi Yip, Yoshiki Masuyama, William Chen, Yuning Wu, Yuxun Tang, Massa Baali, Dareen Alharhi, Dong Zhang, Ruifan Deng, Tejes Srivastava, Haibin Wu, Alexander H. Liu, Bhiksha Raj, Qin Jin, Ruihua Song, Shinji Watanabe

    Abstract: Neural codecs have become crucial to recent speech and audio generation research. In addition to signal compression capabilities, discrete codecs have also been found to enhance downstream training efficiency and compatibility with autoregressive language models. However, as extensive downstream applications are investigated, challenges have arisen in ensuring fair comparisons across diverse appli… ▽ More

    Submitted 24 February, 2025; v1 submitted 24 September, 2024; originally announced September 2024.

    Comments: Accepted by SLT

  40. arXiv:2409.09506  [pdf, other

    cs.SD cs.AI eess.AS

    ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration

    Authors: Masao Someki, Kwanghee Choi, Siddhant Arora, William Chen, Samuele Cornell, Jionghao Han, Yifan Peng, Jiatong Shi, Vaibhav Srivastav, Shinji Watanabe

    Abstract: We introduce ESPnet-EZ, an extension of the open-source speech processing toolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ focuses on two major aspects: (i) easy fine-tuning and inference of existing ESPnet models on various tasks and (ii) easy integration with popular deep neural network frameworks such as PyTorch-Lightning, Hugging Face transformers and datasets, a… ▽ More

    Submitted 14 September, 2024; originally announced September 2024.

    Comments: Accepted to SLT 2024

  41. arXiv:2409.07226  [pdf, other

    cs.SD eess.AS

    Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm

    Authors: Yuning Wu, Jiatong Shi, Yifeng Yu, Yuxun Tang, Tao Qian, Yueqian Lin, Jionghao Han, Xinyi Bai, Shinji Watanabe, Qin Jin

    Abstract: This research presents Muskits-ESPnet, a versatile toolkit that introduces new paradigms to Singing Voice Synthesis (SVS) through the application of pretrained audio models in both continuous and discrete approaches. Specifically, we explore discrete representations derived from SSL models and audio codecs and offer significant advantages in versatility and intelligence, supporting multi-format in… ▽ More

    Submitted 10 October, 2024; v1 submitted 11 September, 2024; originally announced September 2024.

    Comments: Accepted by ACMMM 2024 demo track

  42. arXiv:2409.05155  [pdf, other

    math.OC eess.SY

    Difference Between Cyclic and Distributed Approach in Stochastic Optimization for Multi-agent System

    Authors: Jiahao Shi, James C. Spall

    Abstract: Many stochastic optimization problems in multi-agent systems can be decomposed into smaller subproblems or reduced decision subspaces. The cyclic and distributed approaches are two widely used strategies for solving such problems. In this manuscript, we review four existing methods for addressing these problems and compare them based on their suitable problem frameworks and update rules.

    Submitted 8 September, 2024; originally announced September 2024.

  43. arXiv:2408.16132  [pdf, other

    eess.AS cs.MM cs.SD

    SVDD 2024: The Inaugural Singing Voice Deepfake Detection Challenge

    Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Tomoki Toda, Zhiyao Duan

    Abstract: With the advancements in singing voice generation and the growing presence of AI singers on media platforms, the inaugural Singing Voice Deepfake Detection (SVDD) Challenge aims to advance research in identifying AI-generated singing voices from authentic singers. This challenge features two tracks: a controlled setting track (CtrSVDD) and an in-the-wild scenario track (WildSVDD). The CtrSVDD trac… ▽ More

    Submitted 23 September, 2024; v1 submitted 28 August, 2024; originally announced August 2024.

    Comments: 6 pages, Accepted by 2024 IEEE Spoken Language Technology Workshop (SLT 2024)

  44. arXiv:2408.14262  [pdf

    cs.CL cs.SD eess.AS

    Self-supervised Speech Representations Still Struggle with African American Vernacular English

    Authors: Kalvin Chang, Yi-Hui Chou, Jiatong Shi, Hsuan-Ming Chen, Nicole Holliday, Odette Scharenborg, David R. Mortensen

    Abstract: Underperformance of ASR systems for speakers of African American Vernacular English (AAVE) and other marginalized language varieties is a well-documented phenomenon, and one that reinforces the stigmatization of these varieties. We investigate whether or not the recent wave of Self-Supervised Learning (SSL) speech models can close the gap in ASR performance between AAVE and Mainstream American Eng… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

    Comments: INTERSPEECH 2024

  45. arXiv:2408.13495  [pdf

    eess.IV cs.CV

    Topological GCN for Improving Detection of Hip Landmarks from B-Mode Ultrasound Images

    Authors: Tianxiang Huang, Jing Shi, Ge Jin, Juncheng Li, Jun Wang, Jun Du, Jun Shi

    Abstract: The B-mode ultrasound based computer-aided diagnosis (CAD) has demonstrated its effectiveness for diagnosis of Developmental Dysplasia of the Hip (DDH) in infants. However, due to effect of speckle noise in ultrasound im-ages, it is still a challenge task to accurately detect hip landmarks. In this work, we propose a novel hip landmark detection model by integrating the Topological GCN (TGCN) with… ▽ More

    Submitted 24 August, 2024; originally announced August 2024.

  46. arXiv:2408.01127  [pdf, ps, other

    eess.SY

    Relax, Estimate, and Track: a Simple Battery State-of-charge and State-of-health Estimation Method

    Authors: Shida Jiang, Junzhe Shi, Scott Moura

    Abstract: Battery management is a critical component of ubiquitous battery-powered energy systems, in which battery state-of-charge (SOC) and state-of-health (SOH) estimations are of crucial importance. Conventional SOC and SOH estimation methods, especially model-based methods, often lack accurate modeling of the open circuit voltage (OCV), have relatively high computational complexity, and lack theoretica… ▽ More

    Submitted 6 June, 2025; v1 submitted 2 August, 2024; originally announced August 2024.

    Comments: Minor changes to texts. The codes and dataset are now attached

  47. arXiv:2407.21395  [pdf, other

    eess.IV

    HINER: Neural Representation for Hyperspectral Image

    Authors: Junqi Shi, Mingyi Jiang, Ming Lu, Tong Chen, Xun Cao, Zhan Ma

    Abstract: This paper introduces {HINER}, a novel neural representation for compressing HSI and ensuring high-quality downstream tasks on compressed HSI. HINER fully exploits inter-spectral correlations by explicitly encoding of spectral wavelengths and achieves a compact representation of the input HSI sample through joint optimization with a learnable decoder. By additionally incorporating the Content Angl… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

    Comments: ACM MM24

  48. arXiv:2407.14140  [pdf, other

    eess.SP

    A Secure and Efficient Distributed Semantic Communication System for Heterogeneous Internet of Things

    Authors: Weihao Zeng, Xinyu Xu, Qianyun Zhang, Jiting Shi, Zhenyu Guan, Shufeng Li, Zhijin Qin

    Abstract: Semantic communications are expected to improve the transmission efficiency in Internet of Things (IoT) networks. However, the distributed nature of networks and heterogeneity of devices challenge the secure utilization of semantic communication systems. In this paper, we develop a distributed semantic communication system that achieves the security and efficiency during update and usage phases. A… ▽ More

    Submitted 11 December, 2024; v1 submitted 19 July, 2024; originally announced July 2024.

  49. arXiv:2407.05717  [pdf, ps, other

    eess.SY cs.RO eess.SP

    A New Framework for Nonlinear Kalman Filters

    Authors: Shida Jiang, Junzhe Shi, Scott Moura

    Abstract: The Kalman filter (KF) is a state estimation algorithm that optimally combines system knowledge and measurements to minimize the mean squared error of the estimated states. While KF was initially designed for linear systems, numerous extensions of it, such as extended Kalman filter (EKF), unscented Kalman filter (UKF), cubature Kalman filter (CKF), etc., have been proposed for nonlinear systems ov… ▽ More

    Submitted 19 June, 2025; v1 submitted 8 July, 2024; originally announced July 2024.

    Comments: Massive revisions to the theoretical analysis part. Now Theorem 1 becomes much stronger and useful

  50. arXiv:2407.04051  [pdf, other

    cs.SD cs.AI eess.AS

    FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

    Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang , et al. (8 additional authors not shown)

    Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, sp… ▽ More

    Submitted 10 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

    Comments: Work in progress. Authors are listed in alphabetical order by family name