-
Knowledge-guided Complex Diffusion Model for PolSAR Image Classification in Contourlet Domain
Authors:
Junfei Shi,
Yu Cheng,
Haiyan Jin,
Junhuai Li,
Zhaolin Xiao,
Maoguo Gong,
Weisi Lin
Abstract:
Diffusion models have demonstrated exceptional performance across various domains due to their ability to model and generate complicated data distributions. However, when applied to PolSAR data, traditional real-valued diffusion models face challenges in capturing complex-valued phase information.Moreover, these models often struggle to preserve fine structural details. To address these limitation…
▽ More
Diffusion models have demonstrated exceptional performance across various domains due to their ability to model and generate complicated data distributions. However, when applied to PolSAR data, traditional real-valued diffusion models face challenges in capturing complex-valued phase information.Moreover, these models often struggle to preserve fine structural details. To address these limitations, we leverage the Contourlet transform, which provides rich multiscale and multidirectional representations well-suited for PolSAR imagery. We propose a structural knowledge-guided complex diffusion model for PolSAR image classification in the Contourlet domain. Specifically, the complex Contourlet transform is first applied to decompose the data into low- and high-frequency subbands, enabling the extraction of statistical and boundary features. A knowledge-guided complex diffusion network is then designed to model the statistical properties of the low-frequency components. During the process, structural information from high-frequency coefficients is utilized to guide the diffusion process, improving edge preservation. Furthermore, multiscale and multidirectional high-frequency features are jointly learned to further boost classification accuracy. Experimental results on three real-world PolSAR datasets demonstrate that our approach surpasses state-of-the-art methods, particularly in preserving edge details and maintaining region homogeneity in complex terrain.
△ Less
Submitted 8 July, 2025;
originally announced July 2025.
-
CLEP-DG: Contrastive Learning for Speech Emotion Domain Generalization via Soft Prompt Tuning
Authors:
Jiacheng Shi,
Yanfu Zhang,
Ye Gao
Abstract:
Speech Emotion Recognition (SER) is fundamental to affective computing and human-computer interaction, yet existing models struggle to generalize across diverse acoustic conditions. While Contrastive Language-Audio Pretraining (CLAP) provides strong multimodal alignment, it lacks dedicated mechanisms for capturing emotional cues, making it suboptimal for SER. To address this, we propose CLEP-DG, a…
▽ More
Speech Emotion Recognition (SER) is fundamental to affective computing and human-computer interaction, yet existing models struggle to generalize across diverse acoustic conditions. While Contrastive Language-Audio Pretraining (CLAP) provides strong multimodal alignment, it lacks dedicated mechanisms for capturing emotional cues, making it suboptimal for SER. To address this, we propose CLEP-DG, a framework that enhances CLAP's robustness in emotion recognition. First, we fine-tune CLAP to obtain CLEP, adapting it on large-scale emotional speech datasets to better encode emotion-relevant features. Then, we introduce Acoustic Context Prompt Tuning (ACPT), a text-driven augmentation strategy that optimizes learnable prompt vectors to model diverse acoustic environments without additional labeled audio. Finally, leveraging cross-modal transferability, we train a classifier on text-derived embeddings and apply it to the audio encoder during inference, mitigating domain shifts between textual supervision and audio-based emotion recognition. Experiments across five benchmark datasets show that CLEP-DG outperforms prior CLAP-based approaches, achieving state-of-the-art performance in both supervised and domain generalization settings.
△ Less
Submitted 5 July, 2025;
originally announced July 2025.
-
An Adaptive Estimation Approach based on Fisher Information to Overcome the Challenges of LFP Battery SOC Estimation
Authors:
Junzhe Shi,
Shida Jiang,
Shengyu Tao,
Jaewong Lee,
Manashita Borah,
Scott Moura
Abstract:
Robust and Real-time State of Charge (SOC) estimation is essential for Lithium Iron Phosphate (LFP) batteries, which are widely used in electric vehicles (EVs) and energy storage systems due to safety and longevity. However, the flat Open Circuit Voltage (OCV)-SOC curve makes this task particularly challenging. This challenge is complicated by hysteresis effects, and real-world conditions such as…
▽ More
Robust and Real-time State of Charge (SOC) estimation is essential for Lithium Iron Phosphate (LFP) batteries, which are widely used in electric vehicles (EVs) and energy storage systems due to safety and longevity. However, the flat Open Circuit Voltage (OCV)-SOC curve makes this task particularly challenging. This challenge is complicated by hysteresis effects, and real-world conditions such as current bias, voltage quantization errors, and temperature that must be considered in the battery management system use. In this paper, we proposed an adaptive estimation approach to overcome the challenges of LFPSOC estimation. Specifically, the method uses an adaptive fisher information fusion strategy that adaptively combines the SOC estimation from two different models, which are Coulomb counting and equivalent circuit model-based parameter identification. The effectiveness of this strategy is rationalized by the information richness excited by external cycling signals. A 3D OCV-H-SOC map that captures the relationship between OCV, hysteresis, and SOC was proposed as the backbone, and can be generalizable to other widely adopted parameter-identification methods. Extensive validation under ideal and real-world use scenarios, including SOC-OCV flat zones, current bias, voltage quantization errors, low temperatures, and insufficient current excitations, have been performed using 4 driving profiles, i.e., the Orange County Transit Bus Cycle, the California Unified Cycle, the US06 Drive Cycle, and the New York City Cycle, where the results demonstrate superiority over the state-of-the-art unscented Kalman filter, long short-term memory networks and transformer in all validation cases.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
OpusLM: A Family of Open Unified Speech Language Models
Authors:
Jinchuan Tian,
William Chen,
Yifan Peng,
Jiatong Shi,
Siddhant Arora,
Shikhar Bharadwaj,
Takashi Maekaku,
Yusuke Shinohara,
Keita Goto,
Xiang Yue,
Huck Yang,
Shinji Watanabe
Abstract:
This paper presents Open Unified Speech Language Models (OpusLMs), a family of open foundational speech language models (SpeechLMs) up to 7B. Initialized from decoder-only text language models, the OpusLMs are continuously pre-trained on 213K hours of speech-text pairs and 292B text-only tokens. We demonstrate our OpusLMs achieve comparable (or even superior) performance with existing SpeechLMs in…
▽ More
This paper presents Open Unified Speech Language Models (OpusLMs), a family of open foundational speech language models (SpeechLMs) up to 7B. Initialized from decoder-only text language models, the OpusLMs are continuously pre-trained on 213K hours of speech-text pairs and 292B text-only tokens. We demonstrate our OpusLMs achieve comparable (or even superior) performance with existing SpeechLMs in speech recognition, speech synthesis, and text-only capabilities. Technically, this paper articulates our SpeechLM designs on tokenization, multi-stream language models, and multi-stage training strategies. We experimentally demonstrate the importance of model size scaling and the effect of annealing data selection. The OpusLMs are all built from publicly available materials and are fully transparent models. We release our code, data, checkpoints, and training logs to facilitate open SpeechLM research
△ Less
Submitted 21 June, 2025;
originally announced June 2025.
-
Unsupervised Image Super-Resolution Reconstruction Based on Real-World Degradation Patterns
Authors:
Yiyang Tie,
Hong Zhu,
Yunyun Luo,
Jing Shi
Abstract:
The training of real-world super-resolution reconstruction models heavily relies on datasets that reflect real-world degradation patterns. Extracting and modeling degradation patterns for super-resolution reconstruction using only real-world low-resolution (LR) images remains a challenging task. When synthesizing datasets to simulate real-world degradation, relying solely on degradation extraction…
▽ More
The training of real-world super-resolution reconstruction models heavily relies on datasets that reflect real-world degradation patterns. Extracting and modeling degradation patterns for super-resolution reconstruction using only real-world low-resolution (LR) images remains a challenging task. When synthesizing datasets to simulate real-world degradation, relying solely on degradation extraction methods fails to capture both blur and diverse noise characteristics across varying LR distributions, as well as more implicit degradations such as color gamut shifts. Conversely, domain translation alone cannot accurately approximate real-world blur characteristics due to the significant degradation domain gap between synthetic and real data. To address these challenges, we propose a novel TripleGAN framework comprising two strategically designed components: The FirstGAN primarily focuses on narrowing the domain gap in blur characteristics, while the SecondGAN performs domain-specific translation to approximate target-domain blur properties and learn additional degradation patterns. The ThirdGAN is trained on pseudo-real data generated by the FirstGAN and SecondGAN to reconstruct real-world LR images. Extensive experiments on the RealSR and DRealSR datasets demonstrate that our method exhibits clear advantages in quantitative metrics while maintaining sharp reconstructions without over-smoothing artifacts. The proposed framework effectively learns real-world degradation patterns from LR observations and synthesizes aligned datasets with corresponding degradation characteristics, thereby enabling the trained network to achieve superior performance in reconstructing high-quality SR images from real-world LR inputs.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Improving Speech Enhancement with Multi-Metric Supervision from Learned Quality Assessment
Authors:
Wei Wang,
Wangyou Zhang,
Chenda Li,
Jiatong Shi,
Shinji Watanabe,
Yanmin Qian
Abstract:
Speech quality assessment (SQA) aims to predict the perceived quality of speech signals under a wide range of distortions. It is inherently connected to speech enhancement (SE), which seeks to improve speech quality by removing unwanted signal components. While SQA models are widely used to evaluate SE performance, their potential to guide SE training remains underexplored. In this work, we invest…
▽ More
Speech quality assessment (SQA) aims to predict the perceived quality of speech signals under a wide range of distortions. It is inherently connected to speech enhancement (SE), which seeks to improve speech quality by removing unwanted signal components. While SQA models are widely used to evaluate SE performance, their potential to guide SE training remains underexplored. In this work, we investigate a training framework that leverages a SQA model, trained to predict multiple evaluation metrics from a public SE leaderboard, as a supervisory signal for SE. This approach addresses a key limitation of conventional SE objectives, such as SI-SNR, which often fail to align with perceptual quality and generalize poorly across evaluation metrics. Moreover, it enables training on real-world data where clean references are unavailable. Experiments on both simulated and real-world test sets show that SQA-guided training consistently improves performance across a range of quality metrics.
△ Less
Submitted 13 June, 2025;
originally announced June 2025.
-
Discrete Audio Tokens: More Than a Survey!
Authors:
Pooneh Mousavi,
Gallil Maimon,
Adel Moumen,
Darius Petermann,
Jiatong Shi,
Haibin Wu,
Haici Yang,
Anastasia Kuznetsova,
Artem Ploujnikov,
Ricard Marxer,
Bhuvana Ramabhadran,
Benjamin Elizalde,
Loren Lugosch,
Jinyu Li,
Cem Subakan,
Phil Woodland,
Minje Kim,
Hung-yi Lee,
Shinji Watanabe,
Yossi Adi,
Mirco Ravanelli
Abstract:
Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs).…
▽ More
Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.
△ Less
Submitted 16 June, 2025; v1 submitted 11 June, 2025;
originally announced June 2025.
-
Chain-of-Thought Training for Open E2E Spoken Dialogue Systems
Authors:
Siddhant Arora,
Jinchuan Tian,
Hayato Futami,
Jee-weon Jung,
Jiatong Shi,
Yosuke Kashiwagi,
Emiru Tsunoo,
Shinji Watanabe
Abstract:
Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions. However, existing E2E approaches often require large-scale training data and generates responses lacking semantic coherence. We propose a simple yet effective strategy leveraging a chain-of-th…
▽ More
Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions. However, existing E2E approaches often require large-scale training data and generates responses lacking semantic coherence. We propose a simple yet effective strategy leveraging a chain-of-thought (CoT) formulation, ensuring that training on conversational data remains closely aligned with the multimodal language model (LM)'s pre-training on speech recognition~(ASR), text-to-speech synthesis (TTS), and text LM tasks. Our method achieves over 1.5 ROUGE-1 improvement over the baseline, successfully training spoken dialogue systems on publicly available human-human conversation datasets, while being compute-efficient enough to train on just 300 hours of public human-human conversation data, such as the Switchboard. We will publicly release our models and training code.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation
Authors:
Jiatong Shi,
Yifan Cheng,
Bo-Hao Su,
Hye-jin Shim,
Jinchuan Tian,
Samuele Cornell,
Yiwen Zhao,
Siddhant Arora,
Shinji Watanabe
Abstract:
Speech signal analysis poses significant challenges, particularly in tasks such as speech quality evaluation and profiling, where the goal is to predict multiple perceptual and objective metrics. For instance, metrics like PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), and MOS (Mean Opinion Score) each capture different aspects of speech quality. Howev…
▽ More
Speech signal analysis poses significant challenges, particularly in tasks such as speech quality evaluation and profiling, where the goal is to predict multiple perceptual and objective metrics. For instance, metrics like PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), and MOS (Mean Opinion Score) each capture different aspects of speech quality. However, these metrics often have different scales, assumptions, and dependencies, making joint estimation non-trivial. To address these issues, we introduce ARECHO (Autoregressive Evaluation via Chain-based Hypothesis Optimization), a chain-based, versatile evaluation system for speech assessment grounded in autoregressive dependency modeling. ARECHO is distinguished by three key innovations: (1) a comprehensive speech information tokenization pipeline; (2) a dynamic classifier chain that explicitly captures inter-metric dependencies; and (3) a two-step confidence-oriented decoding algorithm that enhances inference reliability. Experiments demonstrate that ARECHO significantly outperforms the baseline framework across diverse evaluation scenarios, including enhanced speech analysis, speech generation evaluation, and noisy speech evaluation. Furthermore, its dynamic dependency modeling improves interpretability by capturing inter-metric relationships.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
Uni-VERSA: Versatile Speech Assessment with a Unified Network
Authors:
Jiatong Shi,
Hye-Jin Shim,
Shinji Watanabe
Abstract:
Subjective listening tests remain the golden standard for speech quality assessment, but are costly, variable, and difficult to scale. In contrast, existing objective metrics, such as PESQ, F0 correlation, and DNSMOS, typically capture only specific aspects of speech quality. To address these limitations, we introduce Uni-VERSA, a unified network that simultaneously predicts various objective metr…
▽ More
Subjective listening tests remain the golden standard for speech quality assessment, but are costly, variable, and difficult to scale. In contrast, existing objective metrics, such as PESQ, F0 correlation, and DNSMOS, typically capture only specific aspects of speech quality. To address these limitations, we introduce Uni-VERSA, a unified network that simultaneously predicts various objective metrics, encompassing naturalness, intelligibility, speaker characteristics, prosody, and noise, for a comprehensive evaluation of speech signals. We formalize its framework, evaluation protocol, and applications in speech enhancement, synthesis, and quality control. A benchmark based on the URGENT24 challenge, along with a baseline leveraging self-supervised representations, demonstrates that Uni-VERSA provides a viable alternative to single-aspect evaluation methods. Moreover, it aligns closely with human perception, making it a promising approach for future speech quality assessment.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning
Authors:
Renyuan Li,
Zhibo Liang,
Haichuan Zhang,
Tianyu Shi,
Zhiyuan Cheng,
Jia Shi,
Carl Yang,
Mingjie Tang
Abstract:
Recent breakthroughs in text-to-speech (TTS) voice cloning have raised serious privacy concerns, allowing highly accurate vocal identity replication from just a few seconds of reference audio, while retaining the speaker's vocal authenticity. In this paper, we introduce CloneShield, a universal time-domain adversarial perturbation framework specifically designed to defend against zero-shot voice c…
▽ More
Recent breakthroughs in text-to-speech (TTS) voice cloning have raised serious privacy concerns, allowing highly accurate vocal identity replication from just a few seconds of reference audio, while retaining the speaker's vocal authenticity. In this paper, we introduce CloneShield, a universal time-domain adversarial perturbation framework specifically designed to defend against zero-shot voice cloning. Our method provides protection that is robust across speakers and utterances, without requiring any prior knowledge of the synthesized text. We formulate perturbation generation as a multi-objective optimization problem, and propose Multi-Gradient Descent Algorithm (MGDA) to ensure the robust protection across diverse utterances. To preserve natural auditory perception for users, we decompose the adversarial perturbation via Mel-spectrogram representations and fine-tune it for each sample. This design ensures imperceptibility while maintaining strong degradation effects on zero-shot cloned outputs. Experiments on three state-of-the-art zero-shot TTS systems, five benchmark datasets and evaluations from 60 human listeners demonstrate that our method preserves near-original audio quality in protected inputs (PESQ = 3.90, SRS = 0.93) while substantially degrading both speaker similarity and speech quality in cloned samples (PESQ = 1.07, SRS = 0.08).
△ Less
Submitted 25 May, 2025;
originally announced May 2025.
-
MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling
Authors:
Yifan Cheng,
Ruoyi Zhang,
Jiatong Shi
Abstract:
Acquiring large-scale emotional speech data with strong consistency remains a challenge for speech synthesis. This paper presents MIKU-PAL, a fully automated multimodal pipeline for extracting high-consistency emotional speech from unlabeled video data. Leveraging face detection and tracking algorithms, we developed an automatic emotion analysis system using a multimodal large language model (MLLM…
▽ More
Acquiring large-scale emotional speech data with strong consistency remains a challenge for speech synthesis. This paper presents MIKU-PAL, a fully automated multimodal pipeline for extracting high-consistency emotional speech from unlabeled video data. Leveraging face detection and tracking algorithms, we developed an automatic emotion analysis system using a multimodal large language model (MLLM). Our results demonstrate that MIKU-PAL can achieve human-level accuracy (68.5% on MELD) and superior consistency (0.93 Fleiss kappa score) while being much cheaper and faster than human annotation. With the high-quality, flexible, and consistent annotation from MIKU-PAL, we can annotate fine-grained speech emotion categories of up to 26 types, validated by human annotators with 83% rationality ratings. Based on our proposed system, we further released a fine-grained emotional speech dataset MIKU-EmoBench(131.2 hours) as a new benchmark for emotional text-to-speech and visual voice cloning.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results
Authors:
Xin Li,
Kun Yuan,
Bingchen Li,
Fengbin Guan,
Yizhen Shao,
Zihao Yu,
Xijun Wang,
Yiting Lu,
Wei Luo,
Suhang Yao,
Ming Sun,
Chao Zhou,
Zhibo Chen,
Radu Timofte,
Yabin Zhang,
Ao-Xiang Zhang,
Tianwu Zhi,
Jianzhao Liu,
Yang Li,
Jingwen Xu,
Yiting Liao,
Yushen Zuo,
Mingyang Wu,
Renjie Li,
Shengyun Zhong
, et al. (88 additional authors not shown)
Abstract:
This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating re…
▽ More
This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image superresolution. The project is publicly available at https://github.com/lixinustc/KVQE- ChallengeCVPR-NTIRE2025.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
Model Predictive Building Climate Control for Mitigating Heat Pump Noise Pollution (Extended Version)
Authors:
Yun Li,
Jicheng Shi,
Colin N. Jones,
Neil Yorke-Smith,
Tamas Keviczky
Abstract:
Noise pollution from heat pumps (HPs) has been an emerging concern to their broader adoption, especially in densely populated areas. This paper explores a model predictive control (MPC) approach for building climate control, aimed at minimizing the noise nuisance generated by HPs. By exploiting a piecewise linear approximation of HP noise patterns and assuming linear building thermal dynamics, the…
▽ More
Noise pollution from heat pumps (HPs) has been an emerging concern to their broader adoption, especially in densely populated areas. This paper explores a model predictive control (MPC) approach for building climate control, aimed at minimizing the noise nuisance generated by HPs. By exploiting a piecewise linear approximation of HP noise patterns and assuming linear building thermal dynamics, the proposed design can be generalized to handle various HP acoustic patterns with mixed-integer linear programming (MILP). Additionally, two computationally efficient options for defining the noise cost function in the proposed MPC design are discussed. Numerical experiments on a high-fidelity building simulator are performed to demonstrate the viability and effectiveness of the proposed design. Simulation results show that the proposed approach can effectively reduce the noise pollution caused by HPs with negligible energy cost increase.
△ Less
Submitted 5 April, 2025;
originally announced April 2025.
-
OccludeNeRF: Geometric-aware 3D Scene Inpainting with Collaborative Score Distillation in NeRF
Authors:
Jingyu Shi,
Achleshwar Luthra,
Jiazhi Li,
Xiang Gao,
Xiyun Song,
Zongfang Lin,
David Gu,
Heather Yu
Abstract:
With Neural Radiance Fields (NeRFs) arising as a powerful 3D representation, research has investigated its various downstream tasks, including inpainting NeRFs with 2D images. Despite successful efforts addressing the view consistency and geometry quality, prior methods yet suffer from occlusion in NeRF inpainting tasks, where 2D prior is severely limited in forming a faithful reconstruction of th…
▽ More
With Neural Radiance Fields (NeRFs) arising as a powerful 3D representation, research has investigated its various downstream tasks, including inpainting NeRFs with 2D images. Despite successful efforts addressing the view consistency and geometry quality, prior methods yet suffer from occlusion in NeRF inpainting tasks, where 2D prior is severely limited in forming a faithful reconstruction of the scene to inpaint.
To address this, we propose a novel approach that enables cross-view information sharing during knowledge distillation from a diffusion model, effectively propagating occluded information across limited views. Additionally, to align the distillation direction across multiple sampled views, we apply a grid-based denoising strategy and incorporate additional rendered views to enhance cross-view consistency. To assess our approach's capability of handling occlusion cases, we construct a dataset consisting of challenging scenes with severe occlusion, in addition to existing datasets. Compared with baseline methods, our method demonstrates better performance in cross-view consistency and faithfulness in reconstruction, while preserving high rendering quality and fidelity.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
Disturbance-adaptive Model Predictive Control for Bounded Average Constraint Violations
Authors:
Jicheng Shi,
Colin N. Jones
Abstract:
This paper considers stochastic linear time-invariant systems subject to constraints on the average number of state-constraint violations over time without knowing the disturbance distribution. We present a novel disturbance-adaptive model predictive control (DAD-MPC) framework, which adjusts the disturbance model based on measured constraint violations. Using a robust invariance method, DAD-MPC e…
▽ More
This paper considers stochastic linear time-invariant systems subject to constraints on the average number of state-constraint violations over time without knowing the disturbance distribution. We present a novel disturbance-adaptive model predictive control (DAD-MPC) framework, which adjusts the disturbance model based on measured constraint violations. Using a robust invariance method, DAD-MPC ensures recursive feasibility and guarantees asymptotic or robust bounds on average constraint violations. Additionally, the bounds hold even with an inaccurate disturbance model, which allows for data-driven disturbance quantification methods to be used, such as conformal prediction. Simulation results demonstrate that the proposed approach outperforms state-of-the-art methods while satisfying average violation constraints.
△ Less
Submitted 7 May, 2025; v1 submitted 31 March, 2025;
originally announced March 2025.
-
Joint Sparse Graph for Enhanced MIMO-AFDM Receiver Design
Authors:
Qu Luo,
Jing Zhu,
Zilong Liu,
Yanqun Tang,
Pei Xiao,
Gaojie Chen,
Jia Shi
Abstract:
Affine frequency division multiplexing (AFDM) is a promising chirp-assisted multicarrier waveform for future high-mobility communications. This paper is devoted to enhanced receiver design for multiple input and multiple output AFDM (MIMO-AFDM) systems. Firstly, we introduce a unified variational inference (VI) approach to approximate the target posterior distribution, under which the belief propa…
▽ More
Affine frequency division multiplexing (AFDM) is a promising chirp-assisted multicarrier waveform for future high-mobility communications. This paper is devoted to enhanced receiver design for multiple input and multiple output AFDM (MIMO-AFDM) systems. Firstly, we introduce a unified variational inference (VI) approach to approximate the target posterior distribution, under which the belief propagation (BP) and expectation propagation (EP)-based algorithms are derived. As both VI-based detection and low-density parity-check (LDPC) decoding can be expressed by bipartite graphs in MIMO-AFDM systems, we construct a joint sparse graph (JSG) by merging the graphs of these two for low-complexity receiver design. Then, based on this graph model, we present the detailed message propagation of the proposed JSG. Additionally, we propose an enhanced JSG (E-JSG) receiver based on the linear constellation encoding model. The proposed E-JSG eliminates the need for interleavers, de-interleavers, and log-likelihood ratio transformations, thus leading to concurrent detection and decoding over the integrated sparse graph. To further reduce detection complexity, we introduce a sparse channel method by approaximating multiple graph edges with insignificant channel coefficients into a single edge on the VI graph. Simulation results show the superiority of the proposed receivers in terms of computational complexity, detection and decoding latency, and error rate performance compared to the conventional ones.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
Aligning Text-to-Music Evaluation with Human Preferences
Authors:
Yichen Huang,
Zachary Novack,
Koichi Saito,
Jiatong Shi,
Shinji Watanabe,
Yuki Mitsufuji,
John Thickstun,
Chris Donahue
Abstract:
Despite significant recent advances in generative acoustic text-to-music (TTM) modeling, robust evaluation of these models lags behind, relying in particular on the popular Fréchet Audio Distance (FAD). In this work, we rigorously study the design space of reference-based divergence metrics for evaluating TTM models through (1) designing four synthetic meta-evaluations to measure sensitivity to pa…
▽ More
Despite significant recent advances in generative acoustic text-to-music (TTM) modeling, robust evaluation of these models lags behind, relying in particular on the popular Fréchet Audio Distance (FAD). In this work, we rigorously study the design space of reference-based divergence metrics for evaluating TTM models through (1) designing four synthetic meta-evaluations to measure sensitivity to particular musical desiderata, and (2) collecting and evaluating on MusicPrefs, the first open-source dataset of human preferences for TTM systems. We find that not only is the standard FAD setup inconsistent on both synthetic and human preference data, but that nearly all existing metrics fail to effectively capture desiderata, and are only weakly correlated with human perception. We propose a new metric, the MAUVE Audio Divergence (MAD), computed on representations from a self-supervised audio embedding model. We find that this metric effectively captures diverse musical desiderata (average rank correlation 0.84 for MAD vs. 0.49 for FAD and also correlates more strongly with MusicPrefs (0.62 vs. 0.14).
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems
Authors:
Siddhant Arora,
Yifan Peng,
Jiatong Shi,
Jinchuan Tian,
William Chen,
Shikhar Bharadwaj,
Hayato Futami,
Yosuke Kashiwagi,
Emiru Tsunoo,
Shuichiro Shimizu,
Vaibhav Srivastav,
Shinji Watanabe
Abstract:
Advancements in audio foundation models (FMs) have fueled interest in end-to-end (E2E) spoken dialogue systems, but different web interfaces for each system makes it challenging to compare and contrast them effectively. Motivated by this, we introduce an open-source, user-friendly toolkit designed to build unified web interfaces for various cascaded and E2E spoken dialogue systems. Our demo furthe…
▽ More
Advancements in audio foundation models (FMs) have fueled interest in end-to-end (E2E) spoken dialogue systems, but different web interfaces for each system makes it challenging to compare and contrast them effectively. Motivated by this, we introduce an open-source, user-friendly toolkit designed to build unified web interfaces for various cascaded and E2E spoken dialogue systems. Our demo further provides users with the option to get on-the-fly automated evaluation metrics such as (1) latency, (2) ability to understand user input, (3) coherence, diversity, and relevance of system response, and (4) intelligibility and audio quality of system output. Using the evaluation metrics, we compare various cascaded and E2E spoken dialogue systems with a human-human conversation dataset as a proxy. Our analysis demonstrates that the toolkit allows researchers to effortlessly compare and contrast different technologies, providing valuable insights such as current E2E systems having poorer audio quality and less diverse responses. An example demo produced using our toolkit is publicly available here: https://huggingface.co/spaces/Siddhant/Voice_Assistant_Demo.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
ImplicitCell: Resolution Cell Modeling of Joint Implicit Volume Reconstruction and Pose Refinement in Freehand 3D Ultrasound
Authors:
Sheng Song,
Yiting Chen,
Duo Xu,
Songhan Ge,
Yunqian Huang,
Junni Shi,
Man Chen,
Hongbo Chen,
Rui Zheng
Abstract:
Freehand 3D ultrasound enables volumetric imaging by tracking a conventional ultrasound probe during freehand scanning, offering enriched spatial information that improves clinical diagnosis. However, the quality of reconstructed volumes is often compromised by tracking system noise and irregular probe movements, leading to artifacts in the final reconstruction. To address these challenges, we pro…
▽ More
Freehand 3D ultrasound enables volumetric imaging by tracking a conventional ultrasound probe during freehand scanning, offering enriched spatial information that improves clinical diagnosis. However, the quality of reconstructed volumes is often compromised by tracking system noise and irregular probe movements, leading to artifacts in the final reconstruction. To address these challenges, we propose ImplicitCell, a novel framework that integrates Implicit Neural Representation (INR) with an ultrasound resolution cell model for joint optimization of volume reconstruction and pose refinement. Three distinct datasets are used for comprehensive validation, including phantom, common carotid artery, and carotid atherosclerosis. Experimental results demonstrate that ImplicitCell significantly reduces reconstruction artifacts and improves volume quality compared to existing methods, particularly in challenging scenarios with noisy tracking data. These improvements enhance the clinical utility of freehand 3D ultrasound by providing more reliable and precise diagnostic information.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM
Authors:
Jiatong Shi,
Chunlei Zhang,
Jinchuan Tian,
Junrui Ni,
Hao Zhang,
Shinji Watanabe,
Dong Yu
Abstract:
Recent efforts have extended textual LLMs to the speech domain. Yet, a key challenge remains, which is balancing speech understanding and generation while avoiding catastrophic forgetting when integrating acoustically rich codec-based representations into models originally trained on text. In this work, we propose a novel approach that leverages continual pre-training (CPT) on a pre-trained textua…
▽ More
Recent efforts have extended textual LLMs to the speech domain. Yet, a key challenge remains, which is balancing speech understanding and generation while avoiding catastrophic forgetting when integrating acoustically rich codec-based representations into models originally trained on text. In this work, we propose a novel approach that leverages continual pre-training (CPT) on a pre-trained textual LLM to create a codec-based speech language model. This strategy mitigates the modality gap between text and speech, preserving the linguistic reasoning of the original model while enabling high-fidelity speech synthesis. We validate our approach with extensive experiments across multiple tasks, including automatic speech recognition, text-to-speech, speech-to-text translation, and speech-to-speech translation (S2ST), demonstrating that our model achieves superior TTS performance and, notably, the first end-to-end S2ST system based on neural codecs.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
ESPnet-SpeechLM: An Open Speech Language Model Toolkit
Authors:
Jinchuan Tian,
Jiatong Shi,
William Chen,
Siddhant Arora,
Yoshiki Masuyama,
Takashi Maekaku,
Yihan Wu,
Junyi Peng,
Shikhar Bharadwaj,
Yiwen Zhao,
Samuele Cornell,
Yifan Peng,
Xiang Yue,
Chao-Han Huck Yang,
Graham Neubig,
Shinji Watanabe
Abstract:
We present ESPnet-SpeechLM, an open toolkit designed to democratize the development of speech language models (SpeechLMs) and voice-driven agentic applications. The toolkit standardizes speech processing tasks by framing them as universal sequential modeling problems, encompassing a cohesive workflow of data preprocessing, pre-training, inference, and task evaluation. With ESPnet-SpeechLM, users c…
▽ More
We present ESPnet-SpeechLM, an open toolkit designed to democratize the development of speech language models (SpeechLMs) and voice-driven agentic applications. The toolkit standardizes speech processing tasks by framing them as universal sequential modeling problems, encompassing a cohesive workflow of data preprocessing, pre-training, inference, and task evaluation. With ESPnet-SpeechLM, users can easily define task templates and configure key settings, enabling seamless and streamlined SpeechLM development. The toolkit ensures flexibility, efficiency, and scalability by offering highly configurable modules for every stage of the workflow. To illustrate its capabilities, we provide multiple use cases demonstrating how competitive SpeechLMs can be constructed with ESPnet-SpeechLM, including a 1.7B-parameter model pre-trained on both text and speech tasks, across diverse benchmarks. The toolkit and its recipes are fully transparent and reproducible at: https://github.com/espnet/espnet/tree/speechlm.
△ Less
Submitted 24 February, 2025; v1 submitted 21 February, 2025;
originally announced February 2025.
-
On Quantizing Neural Representation for Variable-Rate Video Coding
Authors:
Junqi Shi,
Zhujia Chen,
Hanfei Li,
Qi Zhao,
Ming Lu,
Tong Chen,
Zhan Ma
Abstract:
This work introduces NeuroQuant, a novel post-training quantization (PTQ) approach tailored to non-generalized Implicit Neural Representations for variable-rate Video Coding (INR-VC). Unlike existing methods that require extensive weight retraining for each target bitrate, we hypothesize that variable-rate coding can be achieved by adjusting quantization parameters (QPs) of pre-trained weights. Ou…
▽ More
This work introduces NeuroQuant, a novel post-training quantization (PTQ) approach tailored to non-generalized Implicit Neural Representations for variable-rate Video Coding (INR-VC). Unlike existing methods that require extensive weight retraining for each target bitrate, we hypothesize that variable-rate coding can be achieved by adjusting quantization parameters (QPs) of pre-trained weights. Our study reveals that traditional quantization methods, which assume inter-layer independence, are ineffective for non-generalized INR-VC models due to significant dependencies across layers. To address this, we redefine variable-rate INR-VC as a mixed-precision quantization problem and establish a theoretical framework for sensitivity criteria aimed at simplified, fine-grained rate control. Additionally, we propose network-wise calibration and channel-wise quantization strategies to minimize quantization-induced errors, arriving at a unified formula for representation-oriented PTQ calibration. Our experimental evaluations demonstrate that NeuroQuant significantly outperforms existing techniques in varying bitwidth quantization and compression efficiency, accelerating encoding by up to eight times and enabling quantization down to INT2 with minimal reconstruction loss. This work introduces variable-rate INR-VC for the first time and lays a theoretical foundation for future research in rate-distortion optimization, advancing the field of video coding technology. The materials will be available at https://github.com/Eric-qi/NeuroQuant.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Unsupervised Self-Prior Embedding Neural Representation for Iterative Sparse-View CT Reconstruction
Authors:
Xuanyu Tian,
Lixuan Chen,
Qing Wu,
Chenhe Du,
Jingjing Shi,
Hongjiang Wei,
Yuyao Zhang
Abstract:
Emerging unsupervised implicit neural representation (INR) methods, such as NeRP, NeAT, and SCOPE, have shown great potential to address sparse-view computed tomography (SVCT) inverse problems. Although these INR-based methods perform well in relatively dense SVCT reconstructions, they struggle to achieve comparable performance to supervised methods in sparser SVCT scenarios. They are prone to bei…
▽ More
Emerging unsupervised implicit neural representation (INR) methods, such as NeRP, NeAT, and SCOPE, have shown great potential to address sparse-view computed tomography (SVCT) inverse problems. Although these INR-based methods perform well in relatively dense SVCT reconstructions, they struggle to achieve comparable performance to supervised methods in sparser SVCT scenarios. They are prone to being affected by noise, limiting their applicability in real clinical settings. Additionally, current methods have not fully explored the use of image domain priors for solving SVCsT inverse problems. In this work, we demonstrate that imperfect reconstruction results can provide effective image domain priors for INRs to enhance performance. To leverage this, we introduce Self-prior embedding neural representation (Spener), a novel unsupervised method for SVCT reconstruction that integrates iterative reconstruction algorithms. During each iteration, Spener extracts local image prior features from the previous iteration and embeds them to constrain the solution space. Experimental results on multiple CT datasets show that our unsupervised Spener method achieves performance comparable to supervised state-of-the-art (SOTA) methods on in-domain data while outperforming them on out-of-domain datasets. Moreover, Spener significantly improves the performance of INR-based methods in handling SVCT with noisy sinograms. Our code is available at https://github.com/MeijiTian/Spener.
△ Less
Submitted 7 February, 2025;
originally announced February 2025.
-
Which price to pay? Auto-tuning building MPC controller for optimal economic cost
Authors:
Jiarui Yu,
Jicheng Shi,
Wenjie Xu,
Colin N. Jones
Abstract:
Model predictive control (MPC) controller is considered for temperature management in buildings but its performance heavily depends on hyperparameters. Consequently, MPC necessitates meticulous hyperparameter tuning to attain optimal performance under diverse contracts. However, conventional building controller design is an open-loop process without critical hyperparameter optimization, often lead…
▽ More
Model predictive control (MPC) controller is considered for temperature management in buildings but its performance heavily depends on hyperparameters. Consequently, MPC necessitates meticulous hyperparameter tuning to attain optimal performance under diverse contracts. However, conventional building controller design is an open-loop process without critical hyperparameter optimization, often leading to suboptimal performance due to unexpected environmental disturbances and modeling errors. Furthermore, these hyperparameters are not adapted to different pricing schemes and may lead to non-economic operations. To address these issues, we propose an efficient performance-oriented building MPC controller tuning method based on a cutting-edge efficient constrained Bayesian optimization algorithm, CONFIG, with global optimality guarantees. We demonstrate that this technique can be applied to efficiently deal with real-world DSM program selection problems under customized black-box constraints and objectives. In this study, a simple MPC controller, which offers the advantages of reduced commissioning costs, enhanced computational efficiency, was optimized to perform on a comparable level to a delicately designed and computationally expensive MPC controller. The results also indicate that with an optimized simple MPC, the monthly electricity cost of a household can be reduced by up to 26.90% compared with the cost when controlled by a basic rule-based controller under the same constraints. Then we compared 12 real electricity contracts in Belgium for a household family with customized black-box occupant comfort constraints. The results indicate a monthly electricity bill saving up to 20.18% when the most economic contract is compared with the worst one, which again illustrates the significance of choosing a proper electricity contract.
△ Less
Submitted 18 January, 2025;
originally announced January 2025.
-
Re-Visible Dual-Domain Self-Supervised Deep Unfolding Network for MRI Reconstruction
Authors:
Hao Zhang,
Qi Wang,
Jian Sun,
Zhijie Wen,
Jun Shi,
Shihui Ying
Abstract:
Magnetic Resonance Imaging (MRI) is widely used in clinical practice, but suffered from prolonged acquisition time. Although deep learning methods have been proposed to accelerate acquisition and demonstrate promising performance, they rely on high-quality fully-sampled datasets for training in a supervised manner. However, such datasets are time-consuming and expensive-to-collect, which constrain…
▽ More
Magnetic Resonance Imaging (MRI) is widely used in clinical practice, but suffered from prolonged acquisition time. Although deep learning methods have been proposed to accelerate acquisition and demonstrate promising performance, they rely on high-quality fully-sampled datasets for training in a supervised manner. However, such datasets are time-consuming and expensive-to-collect, which constrains their broader applications. On the other hand, self-supervised methods offer an alternative by enabling learning from under-sampled data alone, but most existing methods rely on further partitioned under-sampled k-space data as model's input for training, resulting in a loss of valuable information. Additionally, their models have not fully incorporated image priors, leading to degraded reconstruction performance. In this paper, we propose a novel re-visible dual-domain self-supervised deep unfolding network to address these issues when only under-sampled datasets are available. Specifically, by incorporating re-visible dual-domain loss, all under-sampled k-space data are utilized during training to mitigate information loss caused by further partitioning. This design enables the model to implicitly adapt to all under-sampled k-space data as input. Additionally, we design a deep unfolding network based on Chambolle and Pock Proximal Point Algorithm (DUN-CP-PPA) to achieve end-to-end reconstruction, incorporating imaging physics and image priors to guide the reconstruction process. By employing a Spatial-Frequency Feature Extraction (SFFE) block to capture global and local feature representation, we enhance the model's efficiency to learn comprehensive image priors. Experiments conducted on the fastMRI and IXI datasets demonstrate that our method significantly outperforms state-of-the-art approaches in terms of reconstruction performance.
△ Less
Submitted 7 January, 2025;
originally announced January 2025.
-
VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music
Authors:
Jiatong Shi,
Hye-jin Shim,
Jinchuan Tian,
Siddhant Arora,
Haibin Wu,
Darius Petermann,
Jia Qi Yip,
You Zhang,
Yuxun Tang,
Wangyou Zhang,
Dareen Safar Alharthi,
Yichen Huang,
Koichi Saito,
Jionghao Han,
Yiwen Zhao,
Chris Donahue,
Shinji Watanabe
Abstract:
In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for various speech, audio, and music signals. The toolkit features a Pythonic interface with flexible configuration and dependency control, making it user-friendly and efficient. With full installation, VERSA offers 65 metrics with 729 metric variations based on different configurations. These metrics encompas…
▽ More
In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for various speech, audio, and music signals. The toolkit features a Pythonic interface with flexible configuration and dependency control, making it user-friendly and efficient. With full installation, VERSA offers 65 metrics with 729 metric variations based on different configurations. These metrics encompass evaluations utilizing diverse external resources, including matching and non-matching reference audio, text transcriptions, and text captions. As a lightweight yet comprehensive toolkit, VERSA is versatile to support the evaluation of a wide range of downstream scenarios. To demonstrate its capabilities, this work highlights example use cases for VERSA, including audio coding, speech synthesis, speech enhancement, singing synthesis, and music generation. The toolkit is available at https://github.com/wavlab-speech/versa.
△ Less
Submitted 26 March, 2025; v1 submitted 23 December, 2024;
originally announced December 2024.
-
Technical Report: Towards Spatial Feature Regularization in Deep-Learning-Based Array-SAR Reconstruction
Authors:
Yu Ren,
Xu Zhan,
Yunqiao Hu,
Xiangdong Ma,
Liang Liu,
Mou Wang,
Jun Shi,
Shunjun Wei,
Tianjiao Zeng,
Xiaoling Zhang
Abstract:
Array synthetic aperture radar (Array-SAR), also known as tomographic SAR (TomoSAR), has demonstrated significant potential for high-quality 3D mapping, particularly in urban areas.While deep learning (DL) methods have recently shown strengths in reconstruction, most studies rely on pixel-by-pixel reconstruction, neglecting spatial features like building structures, leading to artifacts such as ho…
▽ More
Array synthetic aperture radar (Array-SAR), also known as tomographic SAR (TomoSAR), has demonstrated significant potential for high-quality 3D mapping, particularly in urban areas.While deep learning (DL) methods have recently shown strengths in reconstruction, most studies rely on pixel-by-pixel reconstruction, neglecting spatial features like building structures, leading to artifacts such as holes and fragmented edges. Spatial feature regularization, effective in traditional methods, remains underexplored in DL-based approaches. Our study integrates spatial feature regularization into DL-based Array-SAR reconstruction, addressing key questions: What spatial features are relevant in urban-area mapping? How can these features be effectively described, modeled, regularized, and incorporated into DL networks? The study comprises five phases: spatial feature description and modeling, regularization, feature-enhanced network design, evaluation, and discussions. Sharp edges and geometric shapes in urban scenes are analyzed as key features. An intra-slice and inter-slice strategy is proposed, using 2D slices as reconstruction units and fusing them into 3D scenes through parallel and serial fusion. Two computational frameworks-iterative reconstruction with enhancement and light reconstruction with enhancement-are designed, incorporating spatial feature modules into DL networks, leading to four specialized reconstruction networks. Using our urban building simulation dataset and two public datasets, six tests evaluate close-point resolution, structural integrity, and robustness in urban scenarios. Results show that spatial feature regularization significantly improves reconstruction accuracy, retrieves more complete building structures, and enhances robustness by reducing noise and outliers.
△ Less
Submitted 21 December, 2024;
originally announced December 2024.
-
A Robust Anti-noise Scheme for RF Fingerprint Identification
Authors:
Junxian Shi,
Linning Peng,
Wentao Jing,
Lingnan Xie,
Haichuan Peng,
Aiqun Hu
Abstract:
Radio frequency (RF) fingerprint technology is utilized for wireless device identification, extensively employed in the internet of things (IoT). The operating environment for IoT devices is challenging, with pervasive noise and distortion on the signals which blur the feature space of RF fingerprints. Consequently, the model accuracy obtained through training at high signal-to-noise ratio (SNR) s…
▽ More
Radio frequency (RF) fingerprint technology is utilized for wireless device identification, extensively employed in the internet of things (IoT). The operating environment for IoT devices is challenging, with pervasive noise and distortion on the signals which blur the feature space of RF fingerprints. Consequently, the model accuracy obtained through training at high signal-to-noise ratio (SNR) scenarios decreases with the low SNR of the received signals in testing. To solve the noise domain adaptation problem, an anti-noise scheme is proposed to enhance identification accuracy of RF fingerprint at varying SNRs. The squared cross power spectral density (SCPSD) features are first proposed to obtain the same RF fingerprint representation. Subsequently, the specific effect of noise on SCPSD is theoretically derived and the rationality of the scheme is demonstrated through simulation experiments. Finally, 60 off-the-shelf ZigBee devices are employed to evaluate the performance of the anti-noise algorithm. The experimental results show that employing the random subspace k-nearest neighbors (RSKNN) classifier not only effectively classifies devices with multi-cluster feature, but combined with the anti-noise scheme can significantly improve the accuracy by approximately 46% for SNRs not less than 5 dB.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
Seamless Optical Cloud Computing across Edge-Metro Network for Generative AI
Authors:
Sizhe Xing,
Aolong Sun,
Chengxi Wang,
Yizhi Wang,
Boyu Dong,
Junhui Hu,
Xuyu Deng,
An Yan,
Yingjun Liu,
Fangchen Hu,
Zhongya Li,
Ouhan Huang,
Junhao Zhao,
Yingjun Zhou,
Ziwei Li,
Jianyang Shi,
Xi Xiao,
Richard Penty,
Qixiang Cheng,
Nan Chi,
Junwen Zhang
Abstract:
The rapid advancement of generative artificial intelligence (AI) in recent years has profoundly reshaped modern lifestyles, necessitating a revolutionary architecture to support the growing demands for computational power. Cloud computing has become the driving force behind this transformation. However, it consumes significant power and faces computation security risks due to the reliance on exten…
▽ More
The rapid advancement of generative artificial intelligence (AI) in recent years has profoundly reshaped modern lifestyles, necessitating a revolutionary architecture to support the growing demands for computational power. Cloud computing has become the driving force behind this transformation. However, it consumes significant power and faces computation security risks due to the reliance on extensive data centers and servers in the cloud. Reducing power consumption while enhancing computational scale remains persistent challenges in cloud computing. Here, we propose and experimentally demonstrate an optical cloud computing system that can be seamlessly deployed across edge-metro network. By modulating inputs and models into light, a wide range of edge nodes can directly access the optical computing center via the edge-metro network. The experimental validations show an energy efficiency of 118.6 mW/TOPs (tera operations per second), reducing energy consumption by two orders of magnitude compared to traditional electronic-based cloud computing solutions. Furthermore, it is experimentally validated that this architecture can perform various complex generative AI models through parallel computing to achieve image generation tasks.
△ Less
Submitted 1 May, 2025; v1 submitted 4 December, 2024;
originally announced December 2024.
-
Disturbance-Adaptive Data-Driven Predictive Control: Trading Comfort Violations for Savings in Building Climate Control
Authors:
Jicheng Shi,
Christophe Salzmann,
Colin N. Jones
Abstract:
Model Predictive Control (MPC) has demonstrated significant potential in improving energy efficiency in building climate control, outperforming traditional controllers commonly used in modern building management systems. Among MPC variants, Data-driven Predictive Control (DPC) offers the advantage of modeling building dynamics directly from data, thereby substantially reducing commissioning effort…
▽ More
Model Predictive Control (MPC) has demonstrated significant potential in improving energy efficiency in building climate control, outperforming traditional controllers commonly used in modern building management systems. Among MPC variants, Data-driven Predictive Control (DPC) offers the advantage of modeling building dynamics directly from data, thereby substantially reducing commissioning efforts. However, inevitable model uncertainties and measurement noise can result in comfort violations, even with dedicated MPC setups. This paper introduces a Disturbance-Adaptive DPC (DAD-DPC) framework that ensures asymptotic satisfaction of predefined violation bounds without knowing the uncertainty and noise distributions. The framework employs a data-driven pipeline based on Willems' Fundamental Lemma and conformal prediction for application in building climate control. The proposed DAD-DPC framework was validated through four building cases using the high-fidelity BOPTEST simulation platform and an occupied campus building, Polydome. DAD-DPC successfully regulated the average comfort violations to meet pre-defined bounds. Notably, the 5%-violation DAD-DPC setup achieved 30.1%/11.2%/27.1%/20.5% energy savings compared to default controllers across four cases. These results demonstrate the framework's effectiveness in balancing energy consumption and comfort violations, offering a practical solution for building climate control applications.
△ Less
Submitted 1 July, 2025; v1 submitted 12 December, 2024;
originally announced December 2024.
-
How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario
Authors:
Shih-Heng Wang,
Zih-Ching Chen,
Jiatong Shi,
Ming-To Chuang,
Guan-Ting Lin,
Kuan-Po Huang,
David Harwath,
Shang-Wen Li,
Hung-yi Lee
Abstract:
The utilization of speech Self-Supervised Learning (SSL) models achieves impressive performance on Automatic Speech Recognition (ASR). However, in low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages. Typical solutions like fine-tuning the SSL model suffer from high computation costs while using frozen SSL models as feature extractors…
▽ More
The utilization of speech Self-Supervised Learning (SSL) models achieves impressive performance on Automatic Speech Recognition (ASR). However, in low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages. Typical solutions like fine-tuning the SSL model suffer from high computation costs while using frozen SSL models as feature extractors comes with poor performance. To handle these issues, we extend a conventional efficient fine-tuning scheme based on the adapter. We add an extra intermediate adaptation to warm up the adapter and downstream model initialization. Remarkably, we update only 1-5% of the total model parameters to achieve the adaptation. Experimental results on the ML-SUPERB dataset show that our solution outperforms conventional efficient fine-tuning. It achieves up to a 28% relative improvement in the Character/Phoneme error rate when adapting to unseen languages.
△ Less
Submitted 5 January, 2025; v1 submitted 27 November, 2024;
originally announced November 2024.
-
Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition
Authors:
Shih-heng Wang,
Jiatong Shi,
Chien-yu Huang,
Shinji Watanabe,
Hung-yi Lee
Abstract:
Self-supervised learning (SSL) models have shown exceptional capabilities across various speech-processing tasks. Continuous SSL representations are effective but suffer from high computational and storage demands. On the other hand, discrete SSL representations, although with degraded performance, reduce transmission and storage costs, and improve input sequence efficiency through de-duplication…
▽ More
Self-supervised learning (SSL) models have shown exceptional capabilities across various speech-processing tasks. Continuous SSL representations are effective but suffer from high computational and storage demands. On the other hand, discrete SSL representations, although with degraded performance, reduce transmission and storage costs, and improve input sequence efficiency through de-duplication and subword-modeling. To boost the performance of discrete representations for ASR, we introduce a novel fusion mechanism that integrates two discrete representations. The fusion mechanism preserves all the benefits of discrete representation while enhancing the model's performance by integrating complementary information. Additionally, we explore "self-augmented'' discrete representations, which apply transformations to a single continuous SSL representation, eliminating the fusion mechanism's dependency on multiple SSL models and further decreasing its inference costs. Experimental results on benchmarks, including LibriSpeech and ML-SUPERB, indicate up to 19% and 24% relative character error rate improvement compared with the non-fusion baseline, validating the effectiveness of our proposed methods.
△ Less
Submitted 27 November, 2024;
originally announced November 2024.
-
EEG-DCNet: A Fast and Accurate MI-EEG Dilated CNN Classification Method
Authors:
Wei Peng,
Kang Liu,
Jiaxi Shi,
Jianchen Hu
Abstract:
The electroencephalography (EEG)-based motor imagery (MI) classification is a critical and challenging task in brain-computer interface (BCI) technology, which plays a significant role in assisting patients with functional impairments to regain mobility. We present a novel multi-scale atrous convolutional neural network (CNN) model called EEG-dilated convolution network (DCNet) to enhance the accu…
▽ More
The electroencephalography (EEG)-based motor imagery (MI) classification is a critical and challenging task in brain-computer interface (BCI) technology, which plays a significant role in assisting patients with functional impairments to regain mobility. We present a novel multi-scale atrous convolutional neural network (CNN) model called EEG-dilated convolution network (DCNet) to enhance the accuracy and efficiency of the EEG-based MI classification tasks. We incorporate the $1\times1$ convolutional layer and utilize the multi-branch parallel atrous convolutional architecture in EEG-DCNet to capture the highly nonlinear characteristics and multi-scale features of the EEG signals. Moreover, we utilize the sliding window to enhance the temporal consistency and utilize the attension mechanism to improve the accuracy of recognizing user intentions. The experimental results (via the BCI-IV-2a ,BCI-IV-2b and the High-Gamma datasets) show that EEG-DCNet outperforms existing state-of-the-art (SOTA) approaches in terms of classification accuracy and Kappa scores. Furthermore, since EEG-DCNet requires less number of parameters, the training efficiency and memory consumption are also improved. The experiment code is open-sourced at \href{https://github.com/Kanyooo/EEG-DCNet}{here}.
△ Less
Submitted 12 November, 2024;
originally announced November 2024.
-
Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks
Authors:
Chien-yu Huang,
Wei-Chih Chen,
Shu-wen Yang,
Andy T. Liu,
Chen-An Li,
Yu-Xiang Lin,
Wei-Cheng Tseng,
Anuj Diwan,
Yi-Jen Shih,
Jiatong Shi,
William Chen,
Chih-Kai Yang,
Wenze Ren,
Xuanjun Chen,
Chi-Yuan Hsiao,
Puyuan Peng,
Shih-Heng Wang,
Chun-Yi Kuan,
Ke-Han Lu,
Kai-Wei Chang,
Fabian Ritter-Gutierrez,
Kuan-Po Huang,
Siddhant Arora,
You-Kuan Lin,
Ming To Chuang
, et al. (55 additional authors not shown)
Abstract:
Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluati…
▽ More
Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results show that no model performed well universally. SALMONN-13B excelled in English ASR and Qwen2-Audio-7B-Instruct showed high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We open-source all task data and the evaluation pipeline at https://github.com/dynamic-superb/dynamic-superb.
△ Less
Submitted 9 June, 2025; v1 submitted 8 November, 2024;
originally announced November 2024.
-
Exploiting Longitudinal Speech Sessions via Voice Assistant Systems for Early Detection of Cognitive Decline
Authors:
Kristin Qi,
Jiatong Shi,
Caroline Summerour,
John A. Batsis,
Xiaohui Liang
Abstract:
Mild Cognitive Impairment (MCI) is an early stage of Alzheimer's disease (AD), a form of neurodegenerative disorder. Early identification of MCI is crucial for delaying its progression through timely interventions. Existing research has demonstrated the feasibility of detecting MCI using speech collected from clinical interviews or digital devices. However, these approaches typically analyze data…
▽ More
Mild Cognitive Impairment (MCI) is an early stage of Alzheimer's disease (AD), a form of neurodegenerative disorder. Early identification of MCI is crucial for delaying its progression through timely interventions. Existing research has demonstrated the feasibility of detecting MCI using speech collected from clinical interviews or digital devices. However, these approaches typically analyze data collected at limited time points, limiting their ability to identify cognitive changes over time. This paper presents a longitudinal study using voice assistant systems (VAS) to remotely collect seven-session speech data at three-month intervals across 18 months. We propose two methods to improve MCI detection and the prediction of cognitive changes. The first method incorporates historical data, while the second predicts cognitive changes at two time points. Our results indicate improvements when incorporating historical data: the average F1-score for MCI detection improves from 58.6% to 71.2% (by 12.6%) in the case of acoustic features and from 62.1% to 75.1% (by 13.0%) in the case of linguistic features. Additionally, the prediction of cognitive changes achieves an F1-score of 73.7% in the case of acoustic features. These results confirm the potential of VAS-based speech sessions for early detection of cognitive decline.
△ Less
Submitted 15 October, 2024;
originally announced October 2024.
-
HASN: Hybrid Attention Separable Network for Efficient Image Super-resolution
Authors:
Weifeng Cao,
Xiaoyan Lei,
Jun Shi,
Wanyong Liang,
Jie Liu,
Zongfei Bai
Abstract:
Recently, lightweight methods for single image super-resolution (SISR) have gained significant popularity and achieved impressive performance due to limited hardware resources. These methods demonstrate that adopting residual feature distillation is an effective way to enhance performance. However, we find that using residual connections after each block increases the model's storage and computati…
▽ More
Recently, lightweight methods for single image super-resolution (SISR) have gained significant popularity and achieved impressive performance due to limited hardware resources. These methods demonstrate that adopting residual feature distillation is an effective way to enhance performance. However, we find that using residual connections after each block increases the model's storage and computational cost. Therefore, to simplify the network structure and learn higher-level features and relationships between features, we use depthwise separable convolutions, fully connected layers, and activation functions as the basic feature extraction modules. This significantly reduces computational load and the number of parameters while maintaining strong feature extraction capabilities. To further enhance model performance, we propose the Hybrid Attention Separable Block (HASB), which combines channel attention and spatial attention, thus making use of their complementary advantages. Additionally, we use depthwise separable convolutions instead of standard convolutions, significantly reducing the computational load and the number of parameters while maintaining strong feature extraction capabilities. During the training phase, we also adopt a warm-start retraining strategy to exploit the potential of the model further. Extensive experiments demonstrate the effectiveness of our approach. Our method achieves a smaller model size and reduced computational complexity without compromising performance. Code can be available at https://github.com/nathan66666/HASN.git
△ Less
Submitted 13 October, 2024;
originally announced October 2024.
-
Edge-guided inverse design of digital metamaterial-based mode multiplexers for high-capacity multi-dimensional interconnect
Authors:
Aolong Sun,
Sizhe Xing,
Xuyu Deng,
Ruoyu Shen,
An Yan,
Fangchen Hu,
Yuqin Yuan,
Boyu Dong,
Junhao Zhao,
Ouhan Huang,
Ziwei Li,
Jianyang Shi,
Yingjun Zhou,
Chao Shen,
Yiheng Zhao,
Bingzhou Hong,
Wei Chu,
Junwen Zhang,
Haiwen Cai,
Nan Chi
Abstract:
The escalating demands of compute-intensive applications urgently necessitate the adoption of optical interconnect technologies to overcome bottlenecks in scaling computing systems. This requires fully exploiting the inherent parallelism of light across scalable dimensions for data loading. Here we experimentally demonstrate a synergy of wavelength- and mode- multiplexing combined with high-order…
▽ More
The escalating demands of compute-intensive applications urgently necessitate the adoption of optical interconnect technologies to overcome bottlenecks in scaling computing systems. This requires fully exploiting the inherent parallelism of light across scalable dimensions for data loading. Here we experimentally demonstrate a synergy of wavelength- and mode- multiplexing combined with high-order modulation formats to achieve multi-tens-of-terabits-per-second optical interconnects using foundry-compatible silicon photonic circuits. Implementing an edge-guided analog-and-digital optimization method that integrates high efficiency with fabrication robustness, we achieve the inverse design of mode multiplexers based on digital metamaterial waveguides. Furthermore, we employ a packaged five-mode multiplexing chip, achieving a single-wavelength interconnect capacity of 1.62 Tbit s-1 and a record-setting multi-dimensional interconnect capacity of 38.2 Tbit s-1 across 5 modes and 88 wavelength channels, with high-order formats up to 8-ary pulse-amplitude-modulation (PAM). This study highlights the transformative potential of optical interconnect technologies to surmount the constraints of electronic links, thus setting the stage for next-generation datacenter and optical compute interconnects.
△ Less
Submitted 26 February, 2025; v1 submitted 9 October, 2024;
originally announced October 2024.
-
ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech
Authors:
Jiatong Shi,
Jinchuan Tian,
Yihan Wu,
Jee-weon Jung,
Jia Qi Yip,
Yoshiki Masuyama,
William Chen,
Yuning Wu,
Yuxun Tang,
Massa Baali,
Dareen Alharhi,
Dong Zhang,
Ruifan Deng,
Tejes Srivastava,
Haibin Wu,
Alexander H. Liu,
Bhiksha Raj,
Qin Jin,
Ruihua Song,
Shinji Watanabe
Abstract:
Neural codecs have become crucial to recent speech and audio generation research. In addition to signal compression capabilities, discrete codecs have also been found to enhance downstream training efficiency and compatibility with autoregressive language models. However, as extensive downstream applications are investigated, challenges have arisen in ensuring fair comparisons across diverse appli…
▽ More
Neural codecs have become crucial to recent speech and audio generation research. In addition to signal compression capabilities, discrete codecs have also been found to enhance downstream training efficiency and compatibility with autoregressive language models. However, as extensive downstream applications are investigated, challenges have arisen in ensuring fair comparisons across diverse applications. To address these issues, we present a new open-source platform ESPnet-Codec, which is built on ESPnet and focuses on neural codec training and evaluation. ESPnet-Codec offers various recipes in audio, music, and speech for training and evaluation using several widely adopted codec models. Together with ESPnet-Codec, we present VERSA, a standalone evaluation toolkit, which provides a comprehensive evaluation of codec performance over 20 audio evaluation metrics. Notably, we demonstrate that ESPnet-Codec can be integrated into six ESPnet tasks, supporting diverse applications.
△ Less
Submitted 24 February, 2025; v1 submitted 24 September, 2024;
originally announced September 2024.
-
ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration
Authors:
Masao Someki,
Kwanghee Choi,
Siddhant Arora,
William Chen,
Samuele Cornell,
Jionghao Han,
Yifan Peng,
Jiatong Shi,
Vaibhav Srivastav,
Shinji Watanabe
Abstract:
We introduce ESPnet-EZ, an extension of the open-source speech processing toolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ focuses on two major aspects: (i) easy fine-tuning and inference of existing ESPnet models on various tasks and (ii) easy integration with popular deep neural network frameworks such as PyTorch-Lightning, Hugging Face transformers and datasets, a…
▽ More
We introduce ESPnet-EZ, an extension of the open-source speech processing toolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ focuses on two major aspects: (i) easy fine-tuning and inference of existing ESPnet models on various tasks and (ii) easy integration with popular deep neural network frameworks such as PyTorch-Lightning, Hugging Face transformers and datasets, and Lhotse. By replacing ESPnet design choices inherited from Kaldi with a Python-only, Bash-free interface, we dramatically reduce the effort required to build, debug, and use a new model. For example, to fine-tune a speech foundation model, ESPnet-EZ, compared to ESPnet, reduces the number of newly written code by 2.7x and the amount of dependent code by 6.7x while dramatically reducing the Bash script dependencies. The codebase of ESPnet-EZ is publicly available.
△ Less
Submitted 14 September, 2024;
originally announced September 2024.
-
Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm
Authors:
Yuning Wu,
Jiatong Shi,
Yifeng Yu,
Yuxun Tang,
Tao Qian,
Yueqian Lin,
Jionghao Han,
Xinyi Bai,
Shinji Watanabe,
Qin Jin
Abstract:
This research presents Muskits-ESPnet, a versatile toolkit that introduces new paradigms to Singing Voice Synthesis (SVS) through the application of pretrained audio models in both continuous and discrete approaches. Specifically, we explore discrete representations derived from SSL models and audio codecs and offer significant advantages in versatility and intelligence, supporting multi-format in…
▽ More
This research presents Muskits-ESPnet, a versatile toolkit that introduces new paradigms to Singing Voice Synthesis (SVS) through the application of pretrained audio models in both continuous and discrete approaches. Specifically, we explore discrete representations derived from SSL models and audio codecs and offer significant advantages in versatility and intelligence, supporting multi-format inputs and adaptable data processing workflows for various SVS models. The toolkit features automatic music score error detection and correction, as well as a perception auto-evaluation module to imitate human subjective evaluating scores. Muskits-ESPnet is available at \url{https://github.com/espnet/espnet}.
△ Less
Submitted 10 October, 2024; v1 submitted 11 September, 2024;
originally announced September 2024.
-
Difference Between Cyclic and Distributed Approach in Stochastic Optimization for Multi-agent System
Authors:
Jiahao Shi,
James C. Spall
Abstract:
Many stochastic optimization problems in multi-agent systems can be decomposed into smaller subproblems or reduced decision subspaces. The cyclic and distributed approaches are two widely used strategies for solving such problems. In this manuscript, we review four existing methods for addressing these problems and compare them based on their suitable problem frameworks and update rules.
Many stochastic optimization problems in multi-agent systems can be decomposed into smaller subproblems or reduced decision subspaces. The cyclic and distributed approaches are two widely used strategies for solving such problems. In this manuscript, we review four existing methods for addressing these problems and compare them based on their suitable problem frameworks and update rules.
△ Less
Submitted 8 September, 2024;
originally announced September 2024.
-
SVDD 2024: The Inaugural Singing Voice Deepfake Detection Challenge
Authors:
You Zhang,
Yongyi Zang,
Jiatong Shi,
Ryuichi Yamamoto,
Tomoki Toda,
Zhiyao Duan
Abstract:
With the advancements in singing voice generation and the growing presence of AI singers on media platforms, the inaugural Singing Voice Deepfake Detection (SVDD) Challenge aims to advance research in identifying AI-generated singing voices from authentic singers. This challenge features two tracks: a controlled setting track (CtrSVDD) and an in-the-wild scenario track (WildSVDD). The CtrSVDD trac…
▽ More
With the advancements in singing voice generation and the growing presence of AI singers on media platforms, the inaugural Singing Voice Deepfake Detection (SVDD) Challenge aims to advance research in identifying AI-generated singing voices from authentic singers. This challenge features two tracks: a controlled setting track (CtrSVDD) and an in-the-wild scenario track (WildSVDD). The CtrSVDD track utilizes publicly available singing vocal data to generate deepfakes using state-of-the-art singing voice synthesis and conversion systems. Meanwhile, the WildSVDD track expands upon the existing SingFake dataset, which includes data sourced from popular user-generated content websites. For the CtrSVDD track, we received submissions from 47 teams, with 37 surpassing our baselines and the top team achieving a 1.65% equal error rate. For the WildSVDD track, we benchmarked the baselines. This paper reviews these results, discusses key findings, and outlines future directions for SVDD research.
△ Less
Submitted 23 September, 2024; v1 submitted 28 August, 2024;
originally announced August 2024.
-
Self-supervised Speech Representations Still Struggle with African American Vernacular English
Authors:
Kalvin Chang,
Yi-Hui Chou,
Jiatong Shi,
Hsuan-Ming Chen,
Nicole Holliday,
Odette Scharenborg,
David R. Mortensen
Abstract:
Underperformance of ASR systems for speakers of African American Vernacular English (AAVE) and other marginalized language varieties is a well-documented phenomenon, and one that reinforces the stigmatization of these varieties. We investigate whether or not the recent wave of Self-Supervised Learning (SSL) speech models can close the gap in ASR performance between AAVE and Mainstream American Eng…
▽ More
Underperformance of ASR systems for speakers of African American Vernacular English (AAVE) and other marginalized language varieties is a well-documented phenomenon, and one that reinforces the stigmatization of these varieties. We investigate whether or not the recent wave of Self-Supervised Learning (SSL) speech models can close the gap in ASR performance between AAVE and Mainstream American English (MAE). We evaluate four SSL models (wav2vec 2.0, HuBERT, WavLM, and XLS-R) on zero-shot Automatic Speech Recognition (ASR) for these two varieties and find that these models perpetuate the bias in performance against AAVE. Additionally, the models have higher word error rates on utterances with more phonological and morphosyntactic features of AAVE. Despite the success of SSL speech models in improving ASR for low resource varieties, SSL pre-training alone may not bridge the gap between AAVE and MAE. Our code is publicly available at https://github.com/cmu-llab/s3m-aave.
△ Less
Submitted 26 August, 2024;
originally announced August 2024.
-
Topological GCN for Improving Detection of Hip Landmarks from B-Mode Ultrasound Images
Authors:
Tianxiang Huang,
Jing Shi,
Ge Jin,
Juncheng Li,
Jun Wang,
Jun Du,
Jun Shi
Abstract:
The B-mode ultrasound based computer-aided diagnosis (CAD) has demonstrated its effectiveness for diagnosis of Developmental Dysplasia of the Hip (DDH) in infants. However, due to effect of speckle noise in ultrasound im-ages, it is still a challenge task to accurately detect hip landmarks. In this work, we propose a novel hip landmark detection model by integrating the Topological GCN (TGCN) with…
▽ More
The B-mode ultrasound based computer-aided diagnosis (CAD) has demonstrated its effectiveness for diagnosis of Developmental Dysplasia of the Hip (DDH) in infants. However, due to effect of speckle noise in ultrasound im-ages, it is still a challenge task to accurately detect hip landmarks. In this work, we propose a novel hip landmark detection model by integrating the Topological GCN (TGCN) with an Improved Conformer (TGCN-ICF) into a unified frame-work to improve detection performance. The TGCN-ICF includes two subnet-works: an Improved Conformer (ICF) subnetwork to generate heatmaps and a TGCN subnetwork to additionally refine landmark detection. This TGCN can effectively improve detection accuracy with the guidance of class labels. Moreo-ver, a Mutual Modulation Fusion (MMF) module is developed for deeply ex-changing and fusing the features extracted from the U-Net and Transformer branches in ICF. The experimental results on the real DDH dataset demonstrate that the proposed TGCN-ICF outperforms all the compared algorithms.
△ Less
Submitted 24 August, 2024;
originally announced August 2024.
-
Relax, Estimate, and Track: a Simple Battery State-of-charge and State-of-health Estimation Method
Authors:
Shida Jiang,
Junzhe Shi,
Scott Moura
Abstract:
Battery management is a critical component of ubiquitous battery-powered energy systems, in which battery state-of-charge (SOC) and state-of-health (SOH) estimations are of crucial importance. Conventional SOC and SOH estimation methods, especially model-based methods, often lack accurate modeling of the open circuit voltage (OCV), have relatively high computational complexity, and lack theoretica…
▽ More
Battery management is a critical component of ubiquitous battery-powered energy systems, in which battery state-of-charge (SOC) and state-of-health (SOH) estimations are of crucial importance. Conventional SOC and SOH estimation methods, especially model-based methods, often lack accurate modeling of the open circuit voltage (OCV), have relatively high computational complexity, and lack theoretical analysis. This study introduces a simple SOC and SOH estimation method that overcomes all these weaknesses. The key idea of the proposed method is to momentarily set the cell's current to zero for a few minutes during the charging, perform SOC and SOH estimation based on the measured data, and continue tracking the cell's SOC afterward. The method is based on rigorous theoretical analysis, requires no hyperparameter fine-tuning, and is hundreds of times faster than conventional model-based methods. The method is validated on six batteries charged at different C rates and temperatures, realizing fast and accurate estimations under various conditions, with a SOH root mean square error (RMSE) of around 3% and a SOC RMSE of around 1.5%. The data and codes are available at https://berkeley.box.com/s/jz1w6po2iqzzfy7irxd9ok47ku3tr86j.
△ Less
Submitted 6 June, 2025; v1 submitted 2 August, 2024;
originally announced August 2024.
-
HINER: Neural Representation for Hyperspectral Image
Authors:
Junqi Shi,
Mingyi Jiang,
Ming Lu,
Tong Chen,
Xun Cao,
Zhan Ma
Abstract:
This paper introduces {HINER}, a novel neural representation for compressing HSI and ensuring high-quality downstream tasks on compressed HSI. HINER fully exploits inter-spectral correlations by explicitly encoding of spectral wavelengths and achieves a compact representation of the input HSI sample through joint optimization with a learnable decoder. By additionally incorporating the Content Angl…
▽ More
This paper introduces {HINER}, a novel neural representation for compressing HSI and ensuring high-quality downstream tasks on compressed HSI. HINER fully exploits inter-spectral correlations by explicitly encoding of spectral wavelengths and achieves a compact representation of the input HSI sample through joint optimization with a learnable decoder. By additionally incorporating the Content Angle Mapper with the L1 loss, we can supervise the global and local information within each spectral band, thereby enhancing the overall reconstruction quality. For downstream classification on compressed HSI, we theoretically demonstrate the task accuracy is not only related to the classification loss but also to the reconstruction fidelity through a first-order expansion of the accuracy degradation, and accordingly adapt the reconstruction by introducing Adaptive Spectral Weighting. Owing to the monotonic mapping of HINER between wavelengths and spectral bands, we propose Implicit Spectral Interpolation for data augmentation by adding random variables to input wavelengths during classification model training. Experimental results on various HSI datasets demonstrate the superior compression performance of our HINER compared to the existing learned methods and also the traditional codecs. Our model is lightweight and computationally efficient, which maintains high accuracy for downstream classification task even on decoded HSIs at high compression ratios. Our materials will be released at https://github.com/Eric-qi/HINER.
△ Less
Submitted 31 July, 2024;
originally announced July 2024.
-
A Secure and Efficient Distributed Semantic Communication System for Heterogeneous Internet of Things
Authors:
Weihao Zeng,
Xinyu Xu,
Qianyun Zhang,
Jiting Shi,
Zhenyu Guan,
Shufeng Li,
Zhijin Qin
Abstract:
Semantic communications are expected to improve the transmission efficiency in Internet of Things (IoT) networks. However, the distributed nature of networks and heterogeneity of devices challenge the secure utilization of semantic communication systems. In this paper, we develop a distributed semantic communication system that achieves the security and efficiency during update and usage phases. A…
▽ More
Semantic communications are expected to improve the transmission efficiency in Internet of Things (IoT) networks. However, the distributed nature of networks and heterogeneity of devices challenge the secure utilization of semantic communication systems. In this paper, we develop a distributed semantic communication system that achieves the security and efficiency during update and usage phases. A blockchain-based trust scheme for update is designed to continuously train and synchronize the system in dynamic IoT environments. To improve the updating efficiency, we propose a flexible semantic coding method base on compressive semantic knowledge bases. It greatly reduces the amount of data shared among devices for system update, and realizes the flexible adjustment of the size of knowledge bases and the number of transmitted signal symbols in model training and inference stages. In the usage phase, a signature mechanism for lossy semantics is introduced to guarantee the integrity and authenticity of the transmitted semantics in lossy semantic communications. We further design a noise-aware differential privacy mechanism, which introduces optimized noise based on the different channel information available to heterogeneous devices. Experiments on text transmission tasks show that the proposed system achieves the protection of the integrity and privacy for exchanged semantics, and reduces the data to be transmitted in the update phase by about $35\%$ to $88\%$, and in the usage phase by $60\%$ compared with related works.
△ Less
Submitted 11 December, 2024; v1 submitted 19 July, 2024;
originally announced July 2024.
-
A New Framework for Nonlinear Kalman Filters
Authors:
Shida Jiang,
Junzhe Shi,
Scott Moura
Abstract:
The Kalman filter (KF) is a state estimation algorithm that optimally combines system knowledge and measurements to minimize the mean squared error of the estimated states. While KF was initially designed for linear systems, numerous extensions of it, such as extended Kalman filter (EKF), unscented Kalman filter (UKF), cubature Kalman filter (CKF), etc., have been proposed for nonlinear systems ov…
▽ More
The Kalman filter (KF) is a state estimation algorithm that optimally combines system knowledge and measurements to minimize the mean squared error of the estimated states. While KF was initially designed for linear systems, numerous extensions of it, such as extended Kalman filter (EKF), unscented Kalman filter (UKF), cubature Kalman filter (CKF), etc., have been proposed for nonlinear systems over the last sixty years. Although different types of nonlinear KFs have different pros and cons, they all use the same framework of linear KF. Yet, according to our theoretical and empirical analysis, the framework tends to give overconfident and less accurate state estimations when the measurement functions are nonlinear. Therefore, in this study, we designed a new framework that can be combined with any existing type of nonlinear KFs and showed theoretically and empirically that the new framework estimates the states and covariance more accurately than the old one. The new framework was tested on four different nonlinear KFs and five different tasks, showcasing its ability to reduce estimation errors by several orders of magnitude in low-measurement-noise conditions. The codes are available at https://github.com/Shida-Jiang/A-new-framework-for-nonlinear-Kalman-filters
△ Less
Submitted 19 June, 2025; v1 submitted 8 July, 2024;
originally announced July 2024.
-
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
Authors:
Keyu An,
Qian Chen,
Chong Deng,
Zhihao Du,
Changfeng Gao,
Zhifu Gao,
Yue Gu,
Ting He,
Hangrui Hu,
Kai Hu,
Shengpeng Ji,
Yabin Li,
Zerui Li,
Heng Lu,
Haoneng Luo,
Xiang Lv,
Bin Ma,
Ziyang Ma,
Chongjia Ni,
Changhe Song,
Jiaqi Shi,
Xian Shi,
Hao Wang,
Wen Wang,
Yuxuan Wang
, et al. (8 additional authors not shown)
Abstract:
This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, sp…
▽ More
This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM.
△ Less
Submitted 10 July, 2024; v1 submitted 4 July, 2024;
originally announced July 2024.