Search | arXiv e-print repository

Acoustic Volume Rendering for Neural Impulse Response Fields

Authors: Zitong Lan, Chenhao Zheng, Zhiwei Zheng, Mingmin Zhao

Abstract: Realistic audio synthesis that captures accurate acoustic phenomena is essential for creating immersive experiences in virtual and augmented reality. Synthesizing the sound received at any position relies on the estimation of impulse response (IR), which characterizes how sound propagates in one scene along different paths before arriving at the listener's position. In this paper, we present Acous… ▽ More Realistic audio synthesis that captures accurate acoustic phenomena is essential for creating immersive experiences in virtual and augmented reality. Synthesizing the sound received at any position relies on the estimation of impulse response (IR), which characterizes how sound propagates in one scene along different paths before arriving at the listener's position. In this paper, we present Acoustic Volume Rendering (AVR), a novel approach that adapts volume rendering techniques to model acoustic impulse responses. While volume rendering has been successful in modeling radiance fields for images and neural scene representations, IRs present unique challenges as time-series signals. To address these challenges, we introduce frequency-domain volume rendering and use spherical integration to fit the IR measurements. Our method constructs an impulse response field that inherently encodes wave propagation principles and achieves state-of-the-art performance in synthesizing impulse responses for novel poses. Experiments show that AVR surpasses current leading methods by a substantial margin. Additionally, we develop an acoustic simulation platform, AcoustiX, which provides more accurate and realistic IR simulations than existing simulators. Code for AVR and AcoustiX are available at https://zitonglan.github.io/avr. △ Less

Submitted 9 November, 2024; originally announced November 2024.

Comments: NeurIPS 2024 Spotlight

arXiv:2407.04675 [pdf, other]

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Authors: Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li , et al. (30 additional authors not shown)

Abstract: Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this wor… ▽ More Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance. △ Less

Submitted 10 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

arXiv:2210.16849 [pdf, other]

TT-Net: Dual-path transformer based sound field translation in the spherical harmonic domain

Authors: Yiwen Wang, Zijian Lan, Xihong Wu, Tianshu Qu

Abstract: In the current method for the sound field translation tasks based on spherical harmonic (SH) analysis, the solution based on the additive theorem usually faces the problem of singular values caused by large matrix condition numbers. The influence of different distances and frequencies of the spherical radial function on the stability of the translation matrix will affect the accuracy of the SH coe… ▽ More In the current method for the sound field translation tasks based on spherical harmonic (SH) analysis, the solution based on the additive theorem usually faces the problem of singular values caused by large matrix condition numbers. The influence of different distances and frequencies of the spherical radial function on the stability of the translation matrix will affect the accuracy of the SH coefficients at the selected point. Due to the problems mentioned above, we propose a neural network scheme based on the dual-path transformer. More specifically, the dual-path network is constructed by the self-attention module along the two dimensions of the frequency and order axes. The transform-average-concatenate layer and upscaling layer are introduced in the network, which provides solutions for multiple sampling points and upscaling. Numerical simulation results indicate that both the working frequency range and the distance range of the translation are extended. More accurate higher-order SH coefficients are obtained with the proposed dual-path network. △ Less

Submitted 30 October, 2022; originally announced October 2022.

Comments: Submitted to ICASSP 2023

arXiv:2210.02166 [pdf, other]

Robust Bayesian Inference for Moving Horizon Estimation

Authors: Wenhan Cao, Chang Liu, Zhiqian Lan, Shengbo Eben Li, Wei Pan, Angelo Alessandri

Abstract: The accuracy of moving horizon estimation (MHE) suffers significantly in the presence of measurement outliers. Existing methods address this issue by treating measurements leading to large MHE cost function values as outliers, which are subsequently discarded. This strategy, achieved through solving combinatorial optimization problems, is confined to linear systems to guarantee computational tract… ▽ More The accuracy of moving horizon estimation (MHE) suffers significantly in the presence of measurement outliers. Existing methods address this issue by treating measurements leading to large MHE cost function values as outliers, which are subsequently discarded. This strategy, achieved through solving combinatorial optimization problems, is confined to linear systems to guarantee computational tractability and stability. Contrasting these heuristic solutions, our work reexamines MHE from a Bayesian perspective, unveils the fundamental issue of its lack of robustness: MHE's sensitivity to outliers results from its reliance on the Kullback-Leibler (KL) divergence, where both outliers and inliers are equally considered. To tackle this problem, we propose a robust Bayesian inference framework for MHE, integrating a robust divergence measure to reduce the impact of outliers. In particular, the proposed approach prioritizes the fitting of uncontaminated data and lowers the weight of contaminated ones, instead of directly discarding all potentially contaminated measurements, which may lead to undesirable removal of uncontaminated data. A tuning parameter is incorporated into the framework to adjust the robustness degree to outliers. Notably, the classical MHE can be interpreted as a special case of the proposed approach as the parameter converges to zero. In addition, our method involves only minor modification to the classical MHE stage cost, thus avoiding the high computational complexity associated with previous outlier-robust methods and inherently suitable for nonlinear systems. Most importantly, our method provides robustness and stability guarantees, which are often missing in other outlier-robust Bayes filters. The effectiveness of the proposed method is demonstrated on simulations subject to outliers following different distributions, as well as on physical experiment data. △ Less

Submitted 2 October, 2023; v1 submitted 5 October, 2022; originally announced October 2022.

Comments: 17 pages

arXiv:2112.10894 [pdf]

doi 10.1109/CW52790.2021.00041

Subject-Independent Drowsiness Recognition from Single-Channel EEG with an Interpretable CNN-LSTM model

Authors: Jian Cui, Zirui Lan, Tianhu Zheng, Yisi Liu, Olga Sourina, Lipo Wang, Wolfgang Müller-Wittig

Abstract: For EEG-based drowsiness recognition, it is desirable to use subject-independent recognition since conducting calibration on each subject is time-consuming. In this paper, we propose a novel Convolutional Neural Network (CNN)-Long Short-Term Memory (LSTM) model for subject-independent drowsiness recognition from single-channel EEG signals. Different from existing deep learning models that are most… ▽ More For EEG-based drowsiness recognition, it is desirable to use subject-independent recognition since conducting calibration on each subject is time-consuming. In this paper, we propose a novel Convolutional Neural Network (CNN)-Long Short-Term Memory (LSTM) model for subject-independent drowsiness recognition from single-channel EEG signals. Different from existing deep learning models that are mostly treated as black-box classifiers, the proposed model can explain its decisions for each input sample by revealing which parts of the sample contain important features identified by the model for classification. This is achieved by a visualization technique by taking advantage of the hidden states output by the LSTM layer. Results show that the model achieves an average accuracy of 72.97% on 11 subjects for leave-one-out subject-independent drowsiness recognition on a public dataset, which is higher than the conventional baseline methods of 55.42%-69.27%, and state-of-the-art deep learning methods. Visualization results show that the model has discovered meaningful patterns of EEG signals related to different mental states across different subjects. △ Less

Submitted 21 November, 2021; originally announced December 2021.

Journal ref: 2021 International Conference on Cyberworlds (CW), 2021, pp. 201-208

arXiv:2107.09507 [pdf]

doi 10.1109/TNNLS.2022.3147208

EEG-based Cross-Subject Driver Drowsiness Recognition with an Interpretable Convolutional Neural Network

Authors: Jian Cui, Zirui Lan, Olga Sourina, Wolfgang Müller-Wittig

Abstract: In the context of electroencephalogram (EEG)-based driver drowsiness recognition, it is still challenging to design a calibration-free system, since EEG signals vary significantly among different subjects and recording sessions. Many efforts have been made to use deep learning methods for mental state recognition from EEG signals. However, existing work mostly treats deep learning models as black-… ▽ More In the context of electroencephalogram (EEG)-based driver drowsiness recognition, it is still challenging to design a calibration-free system, since EEG signals vary significantly among different subjects and recording sessions. Many efforts have been made to use deep learning methods for mental state recognition from EEG signals. However, existing work mostly treats deep learning models as black-box classifiers, while what have been learned by the models and to which extent they are affected by the noise in EEG data are still underexplored. In this paper, we develop a novel convolutional neural network combined with an interpretation technique that allows sample-wise analysis of important features for classification. The network has a compact structure and takes advantage of separable convolutions to process the EEG signals in a spatial-temporal sequence. Results show that the model achieves an average accuracy of 78.35% on 11 subjects for leave-one-out cross-subject drowsiness recognition, which is higher than the conventional baseline methods of 53.40%-72.68% and state-of-the-art deep learning methods of 71.75%-75.19%. Interpretation results indicate the model has learned to recognize biologically meaningful features from EEG signals, e.g., Alpha spindles, as strong indicators of drowsiness across different subjects. In addition, we also explore reasons behind some wrongly classified samples with the interpretation technique and discuss potential ways to improve the recognition accuracy. Our work illustrates a promising direction on using interpretable deep learning models to discover meaningful patterns related to different mental states from complex EEG signals. △ Less

Submitted 17 February, 2022; v1 submitted 30 May, 2021; originally announced July 2021.

Journal ref: IEEE Transactions on Neural Networks and Learning Systems, 2022

arXiv:2106.00613 [pdf]

doi 10.1016/j.ymeth.2021.04.017

A Compact and Interpretable Convolutional Neural Network for Cross-Subject Driver Drowsiness Detection from Single-Channel EEG

Authors: Jian Cui, Zirui Lan, Yisi Liu, Ruilin Li, Fan Li, Olga Sourina, Wolfgang Mueller-Wittig

Abstract: Driver drowsiness is one of main factors leading to road fatalities and hazards in the transportation industry. Electroencephalography (EEG) has been considered as one of the best physiological signals to detect drivers drowsy states, since it directly measures neurophysiological activities in the brain. However, designing a calibration-free system for driver drowsiness detection with EEG is still… ▽ More Driver drowsiness is one of main factors leading to road fatalities and hazards in the transportation industry. Electroencephalography (EEG) has been considered as one of the best physiological signals to detect drivers drowsy states, since it directly measures neurophysiological activities in the brain. However, designing a calibration-free system for driver drowsiness detection with EEG is still a challenging task, as EEG suffers from serious mental and physical drifts across different subjects. In this paper, we propose a compact and interpretable Convolutional Neural Network (CNN) to discover shared EEG features across different subjects for driver drowsiness detection. We incorporate the Global Average Pooling (GAP) layer in the model structure, allowing the Class Activation Map (CAM) method to be used for localizing regions of the input signal that contribute most for classification. Results show that the proposed model can achieve an average accuracy of 73.22% on 11 subjects for 2-class cross-subject EEG signal classification, which is higher than conventional machine learning methods and other state-of-art deep learning methods. It is revealed by the visualization technique that the model has learned biologically explainable features, e.g., Alpha spindles and Theta burst, as evidence for the drowsy state. It is also interesting to see that the model uses artifacts that usually dominate the wakeful EEG, e.g., muscle artifacts and sensor drifts, to recognize the alert state. The proposed model illustrates a potential direction to use CNN models as a powerful tool to discover shared features related to different mental states across different subjects from EEG signals. △ Less

Submitted 30 May, 2021; originally announced June 2021.

arXiv:2101.08074 [pdf, other]

Flocking and Collision Avoidance for a Dynamic Squad of Fixed-Wing UAVs Using Deep Reinforcement Learning

Authors: Chao Yan, Xiaojia Xiang, Chang Wang, Zhen Lan

Abstract: Developing the flocking behavior for a dynamic squad of fixed-wing UAVs is still a challenge due to kinematic complexity and environmental uncertainty. In this paper, we deal with the decentralized flocking and collision avoidance problem through deep reinforcement learning (DRL). Specifically, we formulate a decentralized DRL-based decision making framework from the perspective of every follower,… ▽ More Developing the flocking behavior for a dynamic squad of fixed-wing UAVs is still a challenge due to kinematic complexity and environmental uncertainty. In this paper, we deal with the decentralized flocking and collision avoidance problem through deep reinforcement learning (DRL). Specifically, we formulate a decentralized DRL-based decision making framework from the perspective of every follower, where a collision avoidance mechanism is integrated into the flocking controller. Then, we propose a novel reinforcement learning algorithm PS-CACER for training a shared control policy for all the followers. Besides, we design a plug-n-play embedding module based on convolutional neural networks and the attention mechanism. As a result, the variable-length system state can be encoded into a fixed-length embedding vector, which makes the learned DRL policy independent with the number and the order of followers. Finally, numerical simulation results demonstrate the effectiveness of the proposed method, and the learned policies can be directly transferred to semi-physical simulation without any parameter finetuning. △ Less

Submitted 22 July, 2021; v1 submitted 20 January, 2021; originally announced January 2021.

Comments: Accepted for publication in the proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2021)

arXiv:2003.02436 [pdf, other]

Talking-Heads Attention

Authors: Noam Shazeer, Zhenzhong Lan, Youlong Cheng, Nan Ding, Le Hou

Abstract: We introduce "talking-heads attention" - a variation on multi-head attention which includes linearprojections across the attention-heads dimension, immediately before and after the softmax operation.While inserting only a small number of additional parameters and a moderate amount of additionalcomputation, talking-heads attention leads to better perplexities on masked language modeling tasks, aswe… ▽ More We introduce "talking-heads attention" - a variation on multi-head attention which includes linearprojections across the attention-heads dimension, immediately before and after the softmax operation.While inserting only a small number of additional parameters and a moderate amount of additionalcomputation, talking-heads attention leads to better perplexities on masked language modeling tasks, aswell as better quality when transfer-learning to language comprehension and question answering tasks. △ Less

Submitted 5 March, 2020; originally announced March 2020.

Showing 1–9 of 9 results for author: Lan, Z