Search | arXiv e-print repository

arXiv:2506.19885 [pdf, ps, other]

FlightKooba: A Fast Interpretable FTP Model

Authors: Jing Lu, Xuan Wu, Yizhun Tian, Songhan Fan, Yali Fang

Abstract: The Koopman theory is a powerful and effective modeling tool for converting nonlinear systems into linear representations, and flight trajectory prediction (FTP) is a complex nonlinear system. However, current models applying the Koopman theory to FTP tasks are not very effective, model interpretability is indeed an issue, and the Koopman operators are computationally intensive, resulting in long… ▽ More The Koopman theory is a powerful and effective modeling tool for converting nonlinear systems into linear representations, and flight trajectory prediction (FTP) is a complex nonlinear system. However, current models applying the Koopman theory to FTP tasks are not very effective, model interpretability is indeed an issue, and the Koopman operators are computationally intensive, resulting in long training times. To address this issue, this paper proposes a new modeling and control framework based on the HIPPO method, the Koopman theory, and state space equations from cybernetics: FlightKooba. Inspired by the idea of structural state space equations, FlightKooba directly constructs the Koopman operators from data. This makes the framework highly interpretable and significantly reduces the number of trainable parameters in the module, thereby greatly reducing training time. Experiments have demonstrated the superiority of the FlightKooba modeling method in terms of time and memory consumption (training time comparable to the Mamba module without using CUDA-level acceleration; memory reduced by more than 50% on most datasets, with a tenfold reduction in the number of parameters), essentially completing the FTP task. It provides a new method for the fast computation of the Koopman operators, opening up new possibilities for the combination of time series forecasting and control. △ Less

Submitted 24 June, 2025; originally announced June 2025.

Comments: 7 figures

arXiv:2506.06400 [pdf, ps, other]

ResPF: Residual Poisson Flow for Efficient and Physically Consistent Sparse-View CT Reconstruction

Authors: Changsheng Fang, Yongtong Liu, Bahareh Morovati, Shuo Han, Yu Shi, Li Zhou, Shuyi Fan, Hengyong Yu

Abstract: Sparse-view computed tomography (CT) is a practical solution to reduce radiation dose, but the resulting ill-posed inverse problem poses significant challenges for accurate image reconstruction. Although deep learning and diffusion-based methods have shown promising results, they often lack physical interpretability or suffer from high computational costs due to iterative sampling starting from ra… ▽ More Sparse-view computed tomography (CT) is a practical solution to reduce radiation dose, but the resulting ill-posed inverse problem poses significant challenges for accurate image reconstruction. Although deep learning and diffusion-based methods have shown promising results, they often lack physical interpretability or suffer from high computational costs due to iterative sampling starting from random noise. Recent advances in generative modeling, particularly Poisson Flow Generative Models (PFGM), enable high-fidelity image synthesis by modeling the full data distribution. In this work, we propose Residual Poisson Flow (ResPF) Generative Models for efficient and accurate sparse-view CT reconstruction. Based on PFGM++, ResPF integrates conditional guidance from sparse measurements and employs a hijacking strategy to significantly reduce sampling cost by skipping redundant initial steps. However, skipping early stages can degrade reconstruction quality and introduce unrealistic structures. To address this, we embed a data-consistency into each iteration, ensuring fidelity to sparse-view measurements. Yet, PFGM sampling relies on a fixed ordinary differential equation (ODE) trajectory induced by electrostatic fields, which can be disrupted by step-wise data consistency, resulting in unstable or degraded reconstructions. Inspired by ResNet, we introduce a residual fusion module to linearly combine generative outputs with data-consistent reconstructions, effectively preserving trajectory continuity. To the best of our knowledge, this is the first application of Poisson flow models to sparse-view CT. Extensive experiments on synthetic and clinical datasets demonstrate that ResPF achieves superior reconstruction quality, faster inference, and stronger robustness compared to state-of-the-art iterative, learning-based, and diffusion models. △ Less

Submitted 5 June, 2025; originally announced June 2025.

arXiv:2502.18008 [pdf, other]

NotaGen: Advancing Musicality in Symbolic Music Generation with Large Language Model Training Paradigms

Authors: Yashan Wang, Shangda Wu, Jianhuai Hu, Xingjian Du, Yueqi Peng, Yongxin Huang, Shuai Fan, Xiaobing Li, Feng Yu, Maosong Sun

Abstract: We introduce NotaGen, a symbolic music generation model aiming to explore the potential of producing high-quality classical sheet music. Inspired by the success of Large Language Models (LLMs), NotaGen adopts pre-training, fine-tuning, and reinforcement learning paradigms (henceforth referred to as the LLM training paradigms). It is pre-trained on 1.6M pieces of music in ABC notation, and then fin… ▽ More We introduce NotaGen, a symbolic music generation model aiming to explore the potential of producing high-quality classical sheet music. Inspired by the success of Large Language Models (LLMs), NotaGen adopts pre-training, fine-tuning, and reinforcement learning paradigms (henceforth referred to as the LLM training paradigms). It is pre-trained on 1.6M pieces of music in ABC notation, and then fine-tuned on approximately 9K high-quality classical compositions conditioned on "period-composer-instrumentation" prompts. For reinforcement learning, we propose the CLaMP-DPO method, which further enhances generation quality and controllability without requiring human annotations or predefined rewards. Our experiments demonstrate the efficacy of CLaMP-DPO in symbolic music generation models with different architectures and encoding schemes. Furthermore, subjective A/B tests show that NotaGen outperforms baseline models against human compositions, greatly advancing musical aesthetics in symbolic music generation. △ Less

Submitted 21 March, 2025; v1 submitted 25 February, 2025; originally announced February 2025.

arXiv:2502.17499 [pdf]

Detecting Long QT Syndrome and First-Degree Atrioventricular Block using Single-Lead AI-ECG: A Multi-Center Real-World Study

Authors: Sumei Fan, Deyun Zhang, Yue Wang, Shijia Geng, Kun Lu, Meng Sang, Weilun Xu, Haixue Wang, Qinghao Zhao, Chuandong Cheng, Peng Wang, Shenda Hong

Abstract: Home-based single-lead AI-ECG devices have enabled continuous, real-world cardiac monitoring. However, the accuracy of parameter calculations from single-lead AI-ECG algorithm remains to be fully validated, which is critical for conditions such as Long QT Syndrome (LQTS) and First-Degree Atrioventricular Block (AVBI). In this multicenter study, we assessed FeatureDB, an ECG measurements computatio… ▽ More Home-based single-lead AI-ECG devices have enabled continuous, real-world cardiac monitoring. However, the accuracy of parameter calculations from single-lead AI-ECG algorithm remains to be fully validated, which is critical for conditions such as Long QT Syndrome (LQTS) and First-Degree Atrioventricular Block (AVBI). In this multicenter study, we assessed FeatureDB, an ECG measurements computation algorithm, in the context of single-lead monitoring using three annotated datasets: PTB-XL+ (n=21,354), CSE (n=105), and HeartVoice-ECG-lite (n=369). FeatureDB showed strong correlation with standard ECG machines (12SL and Uni-G) in key measurements (PR, QRS, QT, QTc), and high agreement confirmed by Bland-Altman analysis. In detecting LQTS (AUC=0.786) and AVBI (AUC=0.684), FeatureDB demonstrated diagnostic performance comparable to commercial ECG systems (12SL: 0.859/0.716; Uni-G: 0.817/0.605), significantly outperforming ECGDeli (0.501/0.569). Notably, FeatureDB can operate locally on resource-limited devices, facilitating use in low-connectivity settings. These findings confirm the clinical reliability of FeatureDB for single-lead ECG diagnostics and highlight its potential to bridge traditional ECG diagnostics with wearable technology for scalable cardiovascular monitoring and early intervention. △ Less

Submitted 26 April, 2025; v1 submitted 21 February, 2025; originally announced February 2025.

Comments: 29pages, 11 figures, 8 tables

arXiv:2502.16584 [pdf, other]

Audio-FLAN: A Preliminary Release

Authors: Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue

Abstract: Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learnin… ▽ More Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub and will be continuously updated. △ Less

Submitted 23 February, 2025; originally announced February 2025.

arXiv:2501.08868 [pdf, other]

Processing and Analyzing Real-World Driving Data: Insights on Trips, Scenarios, and Human Driving Behaviors

Authors: Jihun Han, Dominik Karbowski, Ayman Moawad, Namdoo Kim, Aymeric Rousseau, Shihong Fan, Jason Hoon Lee, Jinho Ha

Abstract: Analyzing large volumes of real-world driving data is essential for providing meaningful and reliable insights into real-world trips, scenarios, and human driving behaviors. To this end, we developed a multi-level data processing approach that adds new information, segments data, and extracts desired parameters. Leveraging a confidential but extensive dataset (over 1 million km), this approach lea… ▽ More Analyzing large volumes of real-world driving data is essential for providing meaningful and reliable insights into real-world trips, scenarios, and human driving behaviors. To this end, we developed a multi-level data processing approach that adds new information, segments data, and extracts desired parameters. Leveraging a confidential but extensive dataset (over 1 million km), this approach leads to three levels of in-depth analysis: trip, scenario, and driving. The trip-level analysis explains representative properties observed in real-world trips, while the scenario-level analysis focuses on scenario conditions resulting from road events that reduce vehicle speed. The driving-level analysis identifies the cause of driving regimes for specific situations and characterizes typical human driving behaviors. Such analyses can support the design of both trip- and scenario-based tests, the modeling of human drivers, and the establishment of guidelines for connected and automated vehicles. △ Less

Submitted 15 January, 2025; originally announced January 2025.

arXiv:2501.06115 [pdf]

Development of an Advisory System for Parking of a Car and Trailer

Authors: Xincheng Cao, Haochong Chen, Bilin Aksun Guvenc, Levent Guvenc, Shihong Fan, John Harber, Brian Link, Peter Richmond, Dokyung Yim

Abstract: Trailer parking is a challenging task due to the unstable nature of the vehicle-trailer system in reverse motion and the unintuitive steering actions required at the vehicle to accomplish the parking maneuver. This paper presents a strategy to tackle this kind of maneuver with an advisory graphic aid to help the human driver with the task of manually backing up the vehicle-trailer system. A kinema… ▽ More Trailer parking is a challenging task due to the unstable nature of the vehicle-trailer system in reverse motion and the unintuitive steering actions required at the vehicle to accomplish the parking maneuver. This paper presents a strategy to tackle this kind of maneuver with an advisory graphic aid to help the human driver with the task of manually backing up the vehicle-trailer system. A kinematic vehicle-trailer model is derived to describe the low-speed motion of the vehicle-trailer system, and its inverse kinematics is established by generating an equivalent virtual trailer axle steering command. The advisory system graphics is generated based on the inverse kinematics and displays the expected trailer orientation given the current vehicle steer angle and configuration (hitch angle). Simulation study and animation are set up to test the efficacy of the approach, where the user can select both vehicle speed and vehicle steering angle freely, which allows the user to stop the vehicle-trailer system and experiment with different steering inputs to see their effect on the predicted trailer motion before proceeding with the best one according to the advisory graphics, hence creating a series of piecewise continuous control actions similar to how manual trailer reverse parking is usually carried out. The advisory graphics proves to provide the driver with an intuitive understanding of the trailer motion at any given configuration (hitch angle). △ Less

Submitted 10 January, 2025; originally announced January 2025.

arXiv:2412.19078 [pdf, other]

Graph-Enhanced Dual-Stream Feature Fusion with Pre-Trained Model for Acoustic Traffic Monitoring

Authors: Shitong Fan, Feiyang Xiao, Wenbo Wang, Shuhan Qi, Qiaoxi Zhu, Wenwu Wang, Jian Guan

Abstract: Microphone array techniques are widely used in sound source localization and smart city acoustic-based traffic monitoring, but these applications face significant challenges due to the scarcity of labeled real-world traffic audio data and the complexity and diversity of application scenarios. The DCASE Challenge's Task 10 focuses on using multi-channel audio signals to count vehicles (cars or comm… ▽ More Microphone array techniques are widely used in sound source localization and smart city acoustic-based traffic monitoring, but these applications face significant challenges due to the scarcity of labeled real-world traffic audio data and the complexity and diversity of application scenarios. The DCASE Challenge's Task 10 focuses on using multi-channel audio signals to count vehicles (cars or commercial vehicles) and identify their directions (left-to-right or vice versa). In this paper, we propose a graph-enhanced dual-stream feature fusion network (GEDF-Net) for acoustic traffic monitoring, which simultaneously considers vehicle type and direction to improve detection. We propose a graph-enhanced dual-stream feature fusion strategy which consists of a vehicle type feature extraction (VTFE) branch, a vehicle direction feature extraction (VDFE) branch, and a frame-level feature fusion module to combine the type and direction feature for enhanced performance. A pre-trained model (PANNs) is used in the VTFE branch to mitigate data scarcity and enhance the type features, followed by a graph attention mechanism to exploit temporal relationships and highlight important audio events within these features. The frame-level fusion of direction and type features enables fine-grained feature representation, resulting in better detection performance. Experiments demonstrate the effectiveness of our proposed method. GEDF-Net is our submission that achieved 1st place in the DCASE 2024 Challenge Task 10. △ Less

Submitted 26 December, 2024; originally announced December 2024.

Comments: Shitong Fan and Feiyang Xiao contributed equally. Accepted by the IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP)2025

arXiv:2411.13298

A CSI Feedback Framework based on Transmitting the Important Values and Generating the Others

Authors: Zhilin Du, Zhenyu Liu, Haozhen Li, Shilong Fan, Xinyu Gu, Lin Zhang

Abstract: The application of deep learning (DL)-based channel state information (CSI) feedback frameworks in massive multiple-input multiple-output (MIMO) systems has significantly improved reconstruction accuracy. However, the limited generalization of widely adopted autoencoder-based networks for CSI feedback challenges consistent performance under dynamic wireless channel conditions and varying communica… ▽ More The application of deep learning (DL)-based channel state information (CSI) feedback frameworks in massive multiple-input multiple-output (MIMO) systems has significantly improved reconstruction accuracy. However, the limited generalization of widely adopted autoencoder-based networks for CSI feedback challenges consistent performance under dynamic wireless channel conditions and varying communication overhead constraints. To enhance the robustness of DL-based CSI feedback across diverse channel scenarios, we propose a novel framework, ITUG, where the user equipment (UE) transmits only a selected portion of critical values in the CSI matrix, while a generative model deployed at the BS reconstructs the remaining values. Specifically, we introduce a scoring algorithm to identify important values based on amplitude and contrast, an encoding algorithm to convert these values into a bit stream for transmission using adaptive bit length and a modified Huffman codebook, and a Transformer-based generative network named TPMVNet to recover the untransmitted values based on the received important values. Experimental results demonstrate that the ITUG framework, equipped with a single TPMVNet, achieves superior reconstruction performance compared to several high-performance autoencoder models across various channel conditions. △ Less

Submitted 28 November, 2024; v1 submitted 20 November, 2024; originally announced November 2024.

Comments: I have to make some modification on the test dataset and constrast methods in the experimental results segment

arXiv:2410.15078 [pdf, other]

Independent Feature Enhanced Crossmodal Fusion for Match-Mismatch Classification of Speech Stimulus and EEG Response

Authors: Shitong Fan, Wenbo Wang, Feiyang Xiao, Shiheng Zhang, Qiaoxi Zhu, Jian Guan

Abstract: It is crucial for auditory attention decoding to classify matched and mismatched speech stimuli with corresponding EEG responses by exploring their relationship. However, existing methods often adopt two independent networks to encode speech stimulus and EEG response, which neglect the relationship between these signals from the two modalities. In this paper, we propose an independent feature enha… ▽ More It is crucial for auditory attention decoding to classify matched and mismatched speech stimuli with corresponding EEG responses by exploring their relationship. However, existing methods often adopt two independent networks to encode speech stimulus and EEG response, which neglect the relationship between these signals from the two modalities. In this paper, we propose an independent feature enhanced crossmodal fusion model (IFE-CF) for match-mismatch classification, which leverages the fusion feature of the speech stimulus and the EEG response to achieve auditory EEG decoding. Specifically, our IFE-CF contains a crossmodal encoder to encode the speech stimulus and the EEG response with a two-branch structure connected via crossmodal attention mechanism in the encoding process, a multi-channel fusion module to fuse features of two modalities by aggregating the interaction feature obtained from the crossmodal encoder and the independent feature obtained from the speech stimulus and EEG response, and a predictor to give the matching result. In addition, the causal mask is introduced to consider the time delay of the speech-EEG pair in the crossmodal encoder, which further enhances the feature representation for match-mismatch classification. Experiments demonstrate our method's effectiveness with better classification accuracy, as compared with the baseline of the Auditory EEG Decoding Challenge 2023. △ Less

Submitted 19 October, 2024; originally announced October 2024.

Comments: Shitong Fan and Wenbo Wang contributed equally. Accepted by the International Symposium on Chinese Spoken Language Processing (ISCSLP) 2024

arXiv:2410.13992 [pdf]

Resilience-Oriented DG Siting and Sizing Considering Energy Equity Constraint

Authors: Chenchen Li, Fangxing Li, Sufan Jiang, Jin Zhao, Shiyuan Fan, Leon M. Tolbert

Abstract: Extreme weather events can cause widespread power outages and huge economic losses. Low-income customers are more vulnerable to power outages because they live in areas with poorly equipped distribution systems. However, existing approaches to improve grid resilience focus on the overall condition of the system and ignore the outage experiences of low-income customers, which leads to significant e… ▽ More Extreme weather events can cause widespread power outages and huge economic losses. Low-income customers are more vulnerable to power outages because they live in areas with poorly equipped distribution systems. However, existing approaches to improve grid resilience focus on the overall condition of the system and ignore the outage experiences of low-income customers, which leads to significant energy inequities in resilience. Therefore, this paper explores a new resilience-oriented planning method for distributed generator (DG) siting and sizing, by embedding an additional energy equity constraint (EEC). First, the expected load shedding index (ELSI) is defined as the ratio of the load shedding to the original load, which quantifies the resilience-oriented energy equity. Then, the DG siting and sizing problem is formulated as a two-stage stochastic programming with the EEC. The first stage determines the optimal sites and sizes of DG units under investment constraints and EECs, while the second stage optimizes expected costs of unserved load. A subsidiary variable is introduced to ensure the model's solvability. Finally, numerical studies are performed on the IEEE 33-bus and 123-bus systems to verify the effectiveness of the proposed DG planning model in achieving energy equity. Three observations are presented as future guidelines for resilience-oriented DG planning. △ Less

Submitted 17 October, 2024; originally announced October 2024.

arXiv:2409.11623 [pdf]

doi 10.1080/15472450.2023.2186229

A novel pedestrian road crossing simulator for dynamic traffic light scheduling systems

Authors: Dayuan Tan, Mohamed Younis, Wassila Lalouani, Shuyao Fan, Guozhi Song

Abstract: The major advances in intelligent transportation systems are pushing societal services toward autonomy where road management is to be more agile in order to cope with changes and continue to yield optimal performance. However, the pedestrian experience is not sufficiently considered. Particularly, signalized intersections are expected to be popular if not dominant in urban settings where pedestria… ▽ More The major advances in intelligent transportation systems are pushing societal services toward autonomy where road management is to be more agile in order to cope with changes and continue to yield optimal performance. However, the pedestrian experience is not sufficiently considered. Particularly, signalized intersections are expected to be popular if not dominant in urban settings where pedestrian density is high. This paper presents the design of a novel environment for simulating human motion on signalized crosswalks at a fine-grained level. Such a simulation not only captures typical behavior, but also handles cases where large pedestrian groups cross from both directions. The proposed simulator is instrumental for optimized road configuration management where the pedestrians' quality of experience, for example, waiting time, is factored in. The validation results using field data show that an accuracy of 98.37 percent can be obtained for the estimated crossing time. Other results using synthetic data show that our simulator enables optimized traffic light scheduling that diminishes pedestrians' waiting time without sacrificing vehicular throughput. △ Less

Submitted 17 September, 2024; originally announced September 2024.

Journal ref: Journal of Intelligent Transportation Systems 28.5 (2024): 636-650

arXiv:2407.14904 [pdf, other]

Large-vocabulary forensic pathological analyses via prototypical cross-modal contrastive learning

Authors: Chen Shen, Chunfeng Lian, Wanqing Zhang, Fan Wang, Jianhua Zhang, Shuanliang Fan, Xin Wei, Gongji Wang, Kehan Li, Hongshu Mu, Hao Wu, Xinggong Liang, Jianhua Ma, Zhenyuan Wang

Abstract: Forensic pathology is critical in determining the cause and manner of death through post-mortem examinations, both macroscopic and microscopic. The field, however, grapples with issues such as outcome variability, laborious processes, and a scarcity of trained professionals. This paper presents SongCi, an innovative visual-language model (VLM) designed specifically for forensic pathology. SongCi u… ▽ More Forensic pathology is critical in determining the cause and manner of death through post-mortem examinations, both macroscopic and microscopic. The field, however, grapples with issues such as outcome variability, laborious processes, and a scarcity of trained professionals. This paper presents SongCi, an innovative visual-language model (VLM) designed specifically for forensic pathology. SongCi utilizes advanced prototypical cross-modal self-supervised contrastive learning to enhance the accuracy, efficiency, and generalizability of forensic analyses. It was pre-trained and evaluated on a comprehensive multi-center dataset, which includes over 16 million high-resolution image patches, 2,228 vision-language pairs of post-mortem whole slide images (WSIs), and corresponding gross key findings, along with 471 distinct diagnostic outcomes. Our findings indicate that SongCi surpasses existing multi-modal AI models in many forensic pathology tasks, performs comparably to experienced forensic pathologists and significantly better than less experienced ones, and provides detailed multi-modal explainability, offering critical assistance in forensic investigations. To the best of our knowledge, SongCi is the first VLM specifically developed for forensic pathological analysis and the first large-vocabulary computational pathology (CPath) model that directly processes gigapixel WSIs in forensic science. △ Less

Submitted 20 July, 2024; originally announced July 2024.

Comments: 28 pages, 6 figures, under review

arXiv:2406.11546 [pdf, other]

GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement

Authors: Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xunying Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-Qiang Zhang, Guoguo Chen, Xie Chen

Abstract: The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired spee… ▽ More The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired speech and text data. GigaSpeech 2 comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese, gathered from unlabeled YouTube videos. We also introduce an automated pipeline for data crawling, transcription, and label refinement. Specifically, this pipeline involves Whisper for initial transcription, MMS for forced alignment, and multi-dimensional filtering for data quality assurance. A modified Noisy Student Training is developed to further refine flawed pseudo labels iteratively, thereby enhancing model performance. Experimental results on our manually transcribed evaluation set and two public test sets from Common Voice and FLEURS confirm our corpus's high quality and broad applicability. Notably, ASR models trained on GigaSpeech 2 can reduce the word error rate for Thai, Indonesian, and Vietnamese on our challenging and realistic YouTube test set by 25% to 40% compared to Whisper large-v3, with merely 10% model parameters. Furthermore, our ASR models trained on GigaSpeech 2 yield superior performance compared to commercial services. We hope that our newly introduced corpus and pipeline will open a new avenue for low-resource speech recognition and significantly facilitate research in this area. △ Less

Submitted 27 May, 2025; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: Accepted in ACL 2025 (Main)

arXiv:2405.19665 [pdf]

A novel fault localization with data refinement for hydroelectric units

Authors: Jialong Huang, Junlin Song, Penglong Lian, Mengjie Gan, Zhiheng Su, Benhao Wang, Wenji Zhu, Xiaomin Pu, Jianxiao Zou, Shicai Fan

Abstract: Due to the scarcity of fault samples and the complexity of non-linear and non-smooth characteristics data in hydroelectric units, most of the traditional hydroelectric unit fault localization methods are difficult to carry out accurate localization. To address these problems, a sparse autoencoder (SAE)-generative adversarial network (GAN)-wavelet noise reduction (WNR)- manifold-boosted deep learni… ▽ More Due to the scarcity of fault samples and the complexity of non-linear and non-smooth characteristics data in hydroelectric units, most of the traditional hydroelectric unit fault localization methods are difficult to carry out accurate localization. To address these problems, a sparse autoencoder (SAE)-generative adversarial network (GAN)-wavelet noise reduction (WNR)- manifold-boosted deep learning (SG-WMBDL) based fault localization method for hydroelectric units is proposed. To overcome the data scarcity, a SAE is embedded into the GAN to generate more high-quality samples in the data generation module. Considering the signals involving non-linear and non-smooth characteristics, the improved WNR which combining both soft and hard thresholding and local linear embedding (LLE) are utilized to the data preprocessing module in order to reduce the noise and effectively capture the local features. In addition, to seek higher performance, the novel Adaptive Boost (AdaBoost) combined with multi deep learning is proposed to achieve accurate fault localization. The experimental results show that the SG-WMBDL can locate faults for hydroelectric units under a small number of fault samples with non-linear and non-smooth characteristics on higher precision and accuracy compared to other frontier methods, which verifies the effectiveness and practicality of the proposed method. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: 6pages,4 figures,Conference on Decision and Control(CDC) conference

arXiv:2404.15339 [pdf, other]

Efficient EndoNeRF Reconstruction and Its Application for Data-driven Surgical Simulation

Authors: Yuehao Wang, Bingchen Gong, Yonghao Long, Siu Hin Fan, Qi Dou

Abstract: The healthcare industry has a growing need for realistic modeling and efficient simulation of surgical scenes. With effective models of deformable surgical scenes, clinicians are able to conduct surgical planning and surgery training on scenarios close to real-world cases. However, a significant challenge in achieving such a goal is the scarcity of high-quality soft tissue models with accurate sha… ▽ More The healthcare industry has a growing need for realistic modeling and efficient simulation of surgical scenes. With effective models of deformable surgical scenes, clinicians are able to conduct surgical planning and surgery training on scenarios close to real-world cases. However, a significant challenge in achieving such a goal is the scarcity of high-quality soft tissue models with accurate shapes and textures. To address this gap, we present a data-driven framework that leverages emerging neural radiance field technology to enable high-quality surgical reconstruction and explore its application for surgical simulations. We first focus on developing a fast NeRF-based surgical scene 3D reconstruction approach that achieves state-of-the-art performance. This method can significantly outperform traditional 3D reconstruction methods, which have failed to capture large deformations and produce fine-grained shapes and textures. We then propose an automated creation pipeline of interactive surgical simulation environments through a closed mesh extraction algorithm. Our experiments have validated the superior performance and efficiency of our proposed approach in surgical scene 3D reconstruction. We further utilize our reconstructed soft tissues to conduct FEM and MPM simulations, showcasing the practical application of our method in data-driven surgical simulations. △ Less

Submitted 10 April, 2024; originally announced April 2024.

Comments: 14 pages, 4 figures. Accepted by International Journal of Computer Assisted Radiology and Surgery

arXiv:2404.06079 [pdf, other]

The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge

Authors: Yiwei Guo, Chenrun Wang, Yifan Yang, Hankun Wang, Ziyang Ma, Chenpeng Du, Shuai Wang, Hanzheng Li, Shuai Fan, Hui Zhang, Xie Chen, Kai Yu

Abstract: Discrete speech tokens have been more and more popular in multiple speech processing fields, including automatic speech recognition (ASR), text-to-speech (TTS) and singing voice synthesis (SVS). In this paper, we describe the systems developed by the SJTU X-LANCE group for the TTS (acoustic + vocoder), SVS, and ASR tracks in the Interspeech 2024 Speech Processing Using Discrete Speech Unit Challen… ▽ More Discrete speech tokens have been more and more popular in multiple speech processing fields, including automatic speech recognition (ASR), text-to-speech (TTS) and singing voice synthesis (SVS). In this paper, we describe the systems developed by the SJTU X-LANCE group for the TTS (acoustic + vocoder), SVS, and ASR tracks in the Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge. Notably, we achieved 1st rank on the leaderboard in the TTS track both with the whole training set and only 1h training data, with the highest UTMOS score and lowest bitrate among all submissions. △ Less

Submitted 9 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

Comments: 5 pages, 3 figures. Report of a challenge

arXiv:2403.19185 [pdf, other]

Deep CSI Compression for Dual-Polarized Massive MIMO Channels with Disentangled Representation Learning

Authors: Suhang Fan, Wei Xu, Renjie Xie, Shi Jin, Derrick Wing Kwan Ng, Naofal Al-Dhahir

Abstract: Channel state information (CSI) feedback is critical for achieving the promised advantages of enhancing spectral and energy efficiencies in massive multiple-input multiple-output (MIMO) wireless communication systems. Deep learning (DL)-based methods have been proven effective in reducing the required signaling overhead for CSI feedback. In practical dual-polarized MIMO scenarios, channels in the… ▽ More Channel state information (CSI) feedback is critical for achieving the promised advantages of enhancing spectral and energy efficiencies in massive multiple-input multiple-output (MIMO) wireless communication systems. Deep learning (DL)-based methods have been proven effective in reducing the required signaling overhead for CSI feedback. In practical dual-polarized MIMO scenarios, channels in the vertical and horizontal polarization directions tend to exhibit high polarization correlation. To fully exploit the inherent propagation similarity within dual-polarized channels, we propose a disentangled representation neural network (NN) for CSI feedback, referred to as DiReNet. The proposed DiReNet disentangles dual-polarized CSI into three components: polarization-shared information, vertical polarization-specific information, and horizontal polarization-specific information. This disentanglement of dual-polarized CSI enables the minimization of information redundancy caused by the polarization correlation and improves the performance of CSI compression and recovery. Additionally, flexible quantization and network extension schemes are designed. Consequently, our method provides a pragmatic solution for CSI feedback to harness the physical MIMO polarization as a priori information. Our experimental results show that the performance of our proposed DiReNet surpasses that of existing DL-based networks, while also effectively reducing the number of network parameters by nearly one third. △ Less

Submitted 28 March, 2024; originally announced March 2024.

arXiv:2401.08926 [pdf, ps, other]

Stochasticity-aware No-Reference Point Cloud Quality Assessment

Authors: Songlin Fan, Wei Gao, Zhineng Chen, Ge Li, Guoqing Liu, Qicheng Wang

Abstract: The evolution of point cloud processing algorithms necessitates an accurate assessment for their quality. Previous works consistently regard point cloud quality assessment (PCQA) as a MOS regression problem and devise a deterministic mapping, ignoring the stochasticity in generating MOS from subjective tests. This work presents the first probabilistic architecture for no-reference PCQA, motivated… ▽ More The evolution of point cloud processing algorithms necessitates an accurate assessment for their quality. Previous works consistently regard point cloud quality assessment (PCQA) as a MOS regression problem and devise a deterministic mapping, ignoring the stochasticity in generating MOS from subjective tests. This work presents the first probabilistic architecture for no-reference PCQA, motivated by the labeling process of existing datasets. The proposed method can model the quality judging stochasticity of subjects through a tailored conditional variational autoencoder (CVAE) and produces multiple intermediate quality ratings. These intermediate ratings simulate the judgments from different subjects and are then integrated into an accurate quality prediction, mimicking the generation process of a ground truth MOS. Specifically, our method incorporates a Prior Module, a Posterior Module, and a Quality Rating Generator, where the former two modules are introduced to model the judging stochasticity in subjective tests, while the latter is developed to generate diverse quality ratings. Extensive experiments indicate that our approach outperforms previous cutting-edge methods by a large margin and exhibits gratifying cross-dataset robustness. Codes are available at https://git.openi.org.cn/OpenPointCloud/nrpcqa. △ Less

Submitted 15 June, 2025; v1 submitted 16 January, 2024; originally announced January 2024.

Comments: Accepted to IJCAI 2025

arXiv:2310.04992 [pdf, other]

doi 10.1056/AIoa2300221

VisionFM: a Multi-Modal Multi-Task Vision Foundation Model for Generalist Ophthalmic Artificial Intelligence

Authors: Jianing Qiu, Jian Wu, Hao Wei, Peilun Shi, Minqing Zhang, Yunyun Sun, Lin Li, Hanruo Liu, Hongyi Liu, Simeng Hou, Yuyang Zhao, Xuehui Shi, Junfang Xian, Xiaoxia Qu, Sirui Zhu, Lijie Pan, Xiaoniao Chen, Xiaojia Zhang, Shuai Jiang, Kebing Wang, Chenlong Yang, Mingqiang Chen, Sujie Fan, Jianhua Hu, Aiguo Lv , et al. (17 additional authors not shown)

Abstract: We present VisionFM, a foundation model pre-trained with 3.4 million ophthalmic images from 560,457 individuals, covering a broad range of ophthalmic diseases, modalities, imaging devices, and demography. After pre-training, VisionFM provides a foundation to foster multiple ophthalmic artificial intelligence (AI) applications, such as disease screening and diagnosis, disease prognosis, subclassifi… ▽ More We present VisionFM, a foundation model pre-trained with 3.4 million ophthalmic images from 560,457 individuals, covering a broad range of ophthalmic diseases, modalities, imaging devices, and demography. After pre-training, VisionFM provides a foundation to foster multiple ophthalmic artificial intelligence (AI) applications, such as disease screening and diagnosis, disease prognosis, subclassification of disease phenotype, and systemic biomarker and disease prediction, with each application enhanced with expert-level intelligence and accuracy. The generalist intelligence of VisionFM outperformed ophthalmologists with basic and intermediate levels in jointly diagnosing 12 common ophthalmic diseases. Evaluated on a new large-scale ophthalmic disease diagnosis benchmark database, as well as a new large-scale segmentation and detection benchmark database, VisionFM outperformed strong baseline deep neural networks. The ophthalmic image representations learned by VisionFM exhibited noteworthy explainability, and demonstrated strong generalizability to new ophthalmic modalities, disease spectrum, and imaging devices. As a foundation model, VisionFM has a large capacity to learn from diverse ophthalmic imaging data and disparate datasets. To be commensurate with this capacity, in addition to the real data used for pre-training, we also generated and leveraged synthetic ophthalmic imaging data. Experimental results revealed that synthetic data that passed visual Turing tests, can also enhance the representation learning capability of VisionFM, leading to substantial performance gains on downstream ophthalmic AI tasks. Beyond the ophthalmic AI applications developed, validated, and demonstrated in this work, substantial further applications can be achieved in an efficient and cost-effective manner using VisionFM as the foundation. △ Less

Submitted 7 October, 2023; originally announced October 2023.

Journal ref: The latest VisionFM work has been published in NEJM AI, 2024

arXiv:2309.15529 [pdf]

Missing-modality Enabled Multi-modal Fusion Architecture for Medical Data

Authors: Muyu Wang, Shiyu Fan, Yichen Li, Hui Chen

Abstract: Fusing multi-modal data can improve the performance of deep learning models. However, missing modalities are common for medical data due to patients' specificity, which is detrimental to the performance of multi-modal models in applications. Therefore, it is critical to adapt the models to missing modalities. This study aimed to develop an efficient multi-modal fusion architecture for medical data… ▽ More Fusing multi-modal data can improve the performance of deep learning models. However, missing modalities are common for medical data due to patients' specificity, which is detrimental to the performance of multi-modal models in applications. Therefore, it is critical to adapt the models to missing modalities. This study aimed to develop an efficient multi-modal fusion architecture for medical data that was robust to missing modalities and further improved the performance on disease diagnosis.X-ray chest radiographs for the image modality, radiology reports for the text modality, and structured value data for the tabular data modality were fused in this study. Each modality pair was fused with a Transformer-based bi-modal fusion module, and the three bi-modal fusion modules were then combined into a tri-modal fusion framework. Additionally, multivariate loss functions were introduced into the training process to improve model's robustness to missing modalities in the inference process. Finally, we designed comparison and ablation experiments for validating the effectiveness of the fusion, the robustness to missing modalities and the enhancements from each key component. Experiments were conducted on MIMIC-IV, MIMIC-CXR with the 14-label disease diagnosis task. Areas under the receiver operating characteristic curve (AUROC), the area under the precision-recall curve (AUPRC) were used to evaluate models' performance. The experimental results demonstrated that our proposed multi-modal fusion architecture effectively fused three modalities and showed strong robustness to missing modalities. This method is hopeful to be scaled to more modalities to enhance the clinical practicality of the model. △ Less

Submitted 27 September, 2023; originally announced September 2023.

arXiv:2305.03250 [pdf, other]

Experimentally Realizing Convolution Processing in the Photonic Synthetic Frequency Dimension

Authors: Lingling Fan, Kai Wang, Heming Wang, Avik Dutt, Shanhui Fan

Abstract: Convolution is an essential operation in signal and image processing and consumes most of the computing power in convolutional neural networks. Photonic convolution has the promise of addressing computational bottlenecks and outperforming electronic implementations. Performing photonic convolution in the synthetic frequency dimension, which harnesses the dynamics of light in the spectral degrees o… ▽ More Convolution is an essential operation in signal and image processing and consumes most of the computing power in convolutional neural networks. Photonic convolution has the promise of addressing computational bottlenecks and outperforming electronic implementations. Performing photonic convolution in the synthetic frequency dimension, which harnesses the dynamics of light in the spectral degrees of freedom for photons, can lead to highly compact devices. Here we experimentally realize convolution operations in the synthetic frequency dimension. Using a modulated ring resonator, we synthesize arbitrary convolution kernels using a pre-determined modulation waveform with high accuracy. We demonstrate the convolution computation between input frequency combs and synthesized kernels. We also introduce the idea of an additive offset to broaden the kinds of kernels that can be implemented experimentally when the modulation strength is limited. Our work demonstrate the use of synthetic frequency dimension to efficiently encode data and implement computation tasks, leading to a compact and scalable photonic computation architecture. △ Less

Submitted 11 August, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

Comments: Science Advances, in press

arXiv:2301.03331 [pdf, other]

doi 10.1109/TPWRD.2023.3337274

A Specific Task-oriented Semantic Image Communication System for substation patrol inspection

Authors: Senran Fan, Haotai Liang, Chen Dong, Xiaodong Xu, Geng Liu

Abstract: Intelligent inspection robots are widely used in substation patrol inspection, which can help check potential safety hazards by patrolling the substation and sending back scene images. However, when patrolling some marginal areas with weak signal, the scene images cannot be sucessfully transmissted to be used for hidden danger elimination, which greatly reduces the quality of robots'daily work. To… ▽ More Intelligent inspection robots are widely used in substation patrol inspection, which can help check potential safety hazards by patrolling the substation and sending back scene images. However, when patrolling some marginal areas with weak signal, the scene images cannot be sucessfully transmissted to be used for hidden danger elimination, which greatly reduces the quality of robots'daily work. To solve such problem, a Specific Task-oriented Semantic Communication System for Imag-STSCI is designed, which involves the semantic features extraction, transmission, restoration and enhancement to get clearer images sent by intelligent robots under weak signals. Inspired by that only some specific details of the image are needed in such substation patrol inspection task, we proposed a new paradigm of semantic enhancement in such specific task to ensure the clarity of key semantic information when facing a lower bit rate or a low signal-to-noise ratio situation. Across the reality-based simulation, experiments show our STSCI can generally surpass traditional image-compression-based and channel-codingbased or other semantic communication system in the substation patrol inspection task with a lower bit rate even under a low signal-to-noise ratio situation. △ Less

Submitted 13 April, 2024; v1 submitted 9 January, 2023; originally announced January 2023.

Comments: 9 pages, 8 figures

Journal ref: IEEE Transactions on Power Delivery; vol. 39; no. 2; pp. 835-844; April 2024

arXiv:2210.16935 [pdf, other]

Scalable and self-correcting photonic computation using balanced photonic binary tree cascades

Authors: Sunil Pai, Olav Solgaard, Shanhui Fan, David A. B. Miller

Abstract: Programmable unitary photonic networks that interfere hundreds of modes are emerging as a key technology in energy-efficient sensing, machine learning, cryptography, and linear optical quantum computing applications. In this work, we establish a theoretical framework to quantify error tolerance and scalability in a more general class of "binary tree cascade'' programmable photonic networks that ac… ▽ More Programmable unitary photonic networks that interfere hundreds of modes are emerging as a key technology in energy-efficient sensing, machine learning, cryptography, and linear optical quantum computing applications. In this work, we establish a theoretical framework to quantify error tolerance and scalability in a more general class of "binary tree cascade'' programmable photonic networks that accept up to tens of thousands of discrete input modes $N$. To justify this scalability claim, we derive error tolerance and configuration time that scale with $\log_2 N$ for balanced trees versus $N$ in unbalanced trees, despite the same number of total components. Specifically, we use second-order perturbation theory to compute phase sensitivity in each waveguide of balanced and unbalanced networks, and we compute the statistics of the sensitivity given random input vectors. We also evaluate such networks after they self-correct, or self-configure, themselves for errors in the circuit due to fabrication error and environmental drift. Our findings have important implications for scaling photonic circuits to much larger circuit sizes; this scaling is particularly critical for applications such as principal component analysis and fast Fourier transforms, which are important algorithms for machine learning and signal processing. △ Less

Submitted 30 October, 2022; originally announced October 2022.

Comments: 32 pages, 12 figures

arXiv:2202.09361 [pdf, other]

doi 10.1109/TAES.2023.3267761

Parameter Identification of a PN-Guided Incoming Missile Using an Improved Multiple-Model Mechanism

Authors: Yinhan Wang, Jiang Wang, Shipeng Fan

Abstract: An active defense against an incoming missile requires information of it, including a guidance law parameter and a first-order lateral time constant. To this end, assuming that a missile with a proportional navigation (PN) guidance law attempts to attack an aerial target with bang-bang evasive maneuvers, a parameter identification model based on the gated recurrent unit (GRU) neural network is bui… ▽ More An active defense against an incoming missile requires information of it, including a guidance law parameter and a first-order lateral time constant. To this end, assuming that a missile with a proportional navigation (PN) guidance law attempts to attack an aerial target with bang-bang evasive maneuvers, a parameter identification model based on the gated recurrent unit (GRU) neural network is built in this paper. The analytic identification solutions for the guidance law parameter and the first-order lateral time constant are derived. The inputs of the identification model are available kinematic information between the aircraft and the missile, while the outputs contain the regression results of missile parameters. To increase the training speed and the identification accuracy of the Model, an output processing method called improved multiplemodel mechanism (IMMM) is proposed in this paper. The effectiveness of IMMM and the performance of the established model are demonstrated through numerical simulations under various engagement scenarios. △ Less

Submitted 25 January, 2022; originally announced February 2022.

Comments: 9 pages, 10 figures

arXiv:2011.07210 [pdf, other]

Rate Splitting Multiple Access for Joint Communication and Sensing Systems with Unmanned Aerial Vehicles

Authors: Yuwei Li, Wanli Ni, Hui Tian, Meihui Hua, Shaoshuai Fan

Abstract: This paper investigates the problem of resource allocation for joint communication and radar sensing system on rate-splitting multiple access (RSMA) based unmanned aerial vehicle (UAV) system. UAV simultaneously communicates with multiple users and probes signals to targets of interest to exploit cooperative sensing ability and achieve substantial gains in size, cost and power consumption. By virt… ▽ More This paper investigates the problem of resource allocation for joint communication and radar sensing system on rate-splitting multiple access (RSMA) based unmanned aerial vehicle (UAV) system. UAV simultaneously communicates with multiple users and probes signals to targets of interest to exploit cooperative sensing ability and achieve substantial gains in size, cost and power consumption. By virtue of using linearly precoded rate splitting at the transmitter and successive interference cancellation at the receivers, RSMA is introduced as a promising paradigm to manage interference as well as enhance spectrum and energy efficiency. To maximize the energy efficiency of UAV networks, the deployment location and the beamforming matrix are jointly optimized under the constraints of power budget, transmission rate and approximation error. To solve the formulated non-convex problem efficiently, we decompose it into the UAV deployment subproblem and the beamforming optimization subproblem. Then, we invoke the successive convex approximation and difference-of-convex programming as well as Dinkelbach methods to transform the intractable subproblems into convex ones at each iteration. Next, an alternating algorithm is designed to solve the non-linear and non-convex problem in an efficient manner, while the corresponding complexity is analyzed as well. Finally, simulation results reveal that proposed algorithm with RSMA is superior to orthogonal multiple access and power-domain non-orthogonal multiple access in terms of power consumption and energy efficiency. △ Less

Submitted 12 July, 2021; v1 submitted 13 November, 2020; originally announced November 2020.

arXiv:2004.05505 [pdf, ps, other]

doi 10.1109/TII.2020.2985723

Data Age Aware Scheduling for Wireless Powered Mobile-Edge Computing in Industrial Internet of Things

Authors: Hao Wu, Hui Tian, Shaoshuai Fan, Jiazhi Ren

Abstract: Wireless powered mobile edge computing has been envisioned as a promising paradigm to enhance the computation capability of low-power wireless devices in Industrial Internet of Things. An efficient resource scheduling method is critical yet challenging to design in such a scenario due to stochastic traffic arrival, time-coupling uplink/downlink decision and incomplete system state knowledge. To ta… ▽ More Wireless powered mobile edge computing has been envisioned as a promising paradigm to enhance the computation capability of low-power wireless devices in Industrial Internet of Things. An efficient resource scheduling method is critical yet challenging to design in such a scenario due to stochastic traffic arrival, time-coupling uplink/downlink decision and incomplete system state knowledge. To tackle these challenges, an online optimization algorithm is proposed in this paper to maximize long-term system utility balancing throughput and fairness, subject to data age and stability constraints. A set of virtual queues is designed to transform the scheduling task, which is hard to solve due to time-dependent data age constraints, into a stochastic optimization problem. Leveraging Lyapunov and convex optimization techniques, the proposed approach can achieve asymptotically near-optimal online decisions without any prior statistical knowledge, and maintain the asymptotic optimality in the presence of partial and outdated network state information. Numerical simulations corroborate the theoretical analysis and demonstrate the effectiveness of the proposed approach. △ Less

Submitted 26 April, 2020; v1 submitted 11 April, 2020; originally announced April 2020.

Comments: 21 pages, 4 figures, submitted to IEEE Transactions on Industrial Informatics

arXiv:1903.04579 [pdf, other]

doi 10.1109/JSTQE.2019.2930455

Reprogrammable Electro-Optic Nonlinear Activation Functions for Optical Neural Networks

Authors: Ian A. D. Williamson, Tyler W. Hughes, Momchil Minkov, Ben Bartlett, Sunil Pai, Shanhui Fan

Abstract: We introduce an electro-optic hardware platform for nonlinear activation functions in optical neural networks. The optical-to-optical nonlinearity operates by converting a small portion of the input optical signal into an analog electric signal, which is used to intensity-modulate the original optical signal with no reduction in processing speed. Our scheme allows for complete nonlinear on-off con… ▽ More We introduce an electro-optic hardware platform for nonlinear activation functions in optical neural networks. The optical-to-optical nonlinearity operates by converting a small portion of the input optical signal into an analog electric signal, which is used to intensity-modulate the original optical signal with no reduction in processing speed. Our scheme allows for complete nonlinear on-off contrast in transmission at relatively low optical power thresholds and eliminates the requirement of having additional optical sources between each layer of the network. Moreover, the activation function is reconfigurable via electrical bias, allowing it to be programmed or trained to synthesize a variety of nonlinear responses. Using numerical simulations, we demonstrate that this activation function significantly improves the expressiveness of optical neural networks, allowing them to perform well on two benchmark machine learning tasks: learning a multi-input exclusive-OR (XOR) logic function and classification of images of handwritten numbers from the MNIST dataset. The addition of the nonlinear activation function improves test accuracy on the MNIST task from 85% to 94%. △ Less

Submitted 22 July, 2019; v1 submitted 12 March, 2019; originally announced March 2019.

Comments: 12 pages, 6 figures

Journal ref: IEEE Journal of Selected Topics in Quantum Electronics, vol. 26, no. 1, pp. 1-12, Jan. 2020

arXiv:1901.08118 [pdf]

Imaging-free object recognition enabled by optical coherence

Authors: Yixuan Tan, Xin Lei, Xingze Wang, Shanhui Fan, Zongfu Yu

Abstract: Visual object recognition is one of the most important perception functions for a wide range of intelligent machines. A conventional recognition process begins with forming a clear optical image of the object, followed by its computer analysis. In contrast, it is possible to carry out recognition without imaging by using coherent illumination and directly analyzing the optical interference pattern… ▽ More Visual object recognition is one of the most important perception functions for a wide range of intelligent machines. A conventional recognition process begins with forming a clear optical image of the object, followed by its computer analysis. In contrast, it is possible to carry out recognition without imaging by using coherent illumination and directly analyzing the optical interference pattern of the scattered light as captured by an image sensor. Here we show that such direct visual recognition can overcome traditional limitations of imaging optics to realize excellent recognition without focusing, beyond diffraction limit, or in the absence of direct line-of-sight. △ Less

Submitted 23 January, 2019; originally announced January 2019.

arXiv:1710.10384 [pdf]

doi 10.1364/OE.25.033534

Single wavelength 480 Gb/s direct detection over 80km SSMF enabled by Stokes Vector Kramers Kronig transceiver

Authors: Thang Hoang, Mohammed Sowailem, Qunbi Zhuge, Zhenping Xing, Mohamed Morsy-Osman, Eslam El-Fiky, Sujie Fan, Meng Xiang, David V. Plant

Abstract: We propose 4D modulation with directed detection employing a novel Stokes-Vector Kramers-Kronig transceiver. It shows that employing Stokes vector receiver, transmitted digital carrier and Kramers-Kronig detection offers an effective way to de-rotate polarization multiplexed complex double side band signal without using a local oscillator at receiver. The impact of system parameters and configurat… ▽ More We propose 4D modulation with directed detection employing a novel Stokes-Vector Kramers-Kronig transceiver. It shows that employing Stokes vector receiver, transmitted digital carrier and Kramers-Kronig detection offers an effective way to de-rotate polarization multiplexed complex double side band signal without using a local oscillator at receiver. The impact of system parameters and configurations including carrier-to-signal-power ratio, guard band of the digital carrier, oversampling ratio and real MIMO is experimentally investigated. Finally, a record 480 Gb/s data rate over 80 km SSMF is achieved in a 60 Gbaud PDM-16QAM single carrier experiment with a BER below the threshold of 2.0x10-2 △ Less

Submitted 27 October, 2017; originally announced October 2017.

Showing 1–30 of 30 results for author: Fan, S