-
A Survey of Foundation Models for IoT: Taxonomy and Criteria-Based Analysis
Authors:
Hui Wei,
Dong Yoon Lee,
Shubham Rohal,
Zhizhang Hu,
Shiwei Fang,
Shijia Pan
Abstract:
Foundation models have gained growing interest in the IoT domain due to their reduced reliance on labeled data and strong generalizability across tasks, which address key limitations of traditional machine learning approaches. However, most existing foundation model based methods are developed for specific IoT tasks, making it difficult to compare approaches across IoT domains and limiting guidanc…
▽ More
Foundation models have gained growing interest in the IoT domain due to their reduced reliance on labeled data and strong generalizability across tasks, which address key limitations of traditional machine learning approaches. However, most existing foundation model based methods are developed for specific IoT tasks, making it difficult to compare approaches across IoT domains and limiting guidance for applying them to new tasks. This survey aims to bridge this gap by providing a comprehensive overview of current methodologies and organizing them around four shared performance objectives by different domains: efficiency, context-awareness, safety, and security & privacy. For each objective, we review representative works, summarize commonly-used techniques and evaluation metrics. This objective-centric organization enables meaningful cross-domain comparisons and offers practical insights for selecting and designing foundation model based solutions for new IoT tasks. We conclude with key directions for future research to guide both practitioners and researchers in advancing the use of foundation models in IoT applications.
△ Less
Submitted 13 June, 2025;
originally announced June 2025.
-
Delayed-KD: Delayed Knowledge Distillation based CTC for Low-Latency Streaming ASR
Authors:
Longhao Li,
Yangze Li,
Hongfei Xue,
Jie Liu,
Shuai Fang,
Kai Wang,
Lei Xie
Abstract:
CTC-based streaming ASR has gained significant attention in real-world applications but faces two main challenges: accuracy degradation in small chunks and token emission latency. To mitigate these challenges, we propose Delayed-KD, which applies delayed knowledge distillation on CTC posterior probabilities from a non-streaming to a streaming model. Specifically, with a tiny chunk size, we introdu…
▽ More
CTC-based streaming ASR has gained significant attention in real-world applications but faces two main challenges: accuracy degradation in small chunks and token emission latency. To mitigate these challenges, we propose Delayed-KD, which applies delayed knowledge distillation on CTC posterior probabilities from a non-streaming to a streaming model. Specifically, with a tiny chunk size, we introduce a Temporal Alignment Buffer (TAB) that defines a relative delay range compared to the non-streaming teacher model to align CTC outputs and mitigate non-blank token mismatches. Additionally, TAB enables fine-grained control over token emission delay. Experiments on 178-hour AISHELL-1 and 10,000-hour WenetSpeech Mandarin datasets show consistent superiority of Delayed-KD. Impressively, Delayed-KD at 40 ms latency achieves a lower character error rate (CER) of 5.42% on AISHELL-1, comparable to the competitive U2++ model running at 320 ms latency.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Learnable Burst-Encodable Time-of-Flight Imaging for High-Fidelity Long-Distance Depth Sensing
Authors:
Manchao Bao,
Shengjiang Fang,
Tao Yue,
Xuemei Hu
Abstract:
Long-distance depth imaging holds great promise for applications such as autonomous driving and robotics. Direct time-of-flight (dToF) imaging offers high-precision, long-distance depth sensing, yet demands ultra-short pulse light sources and high-resolution time-to-digital converters. In contrast, indirect time-of-flight (iToF) imaging often suffers from phase wrapping and low signal-to-noise rat…
▽ More
Long-distance depth imaging holds great promise for applications such as autonomous driving and robotics. Direct time-of-flight (dToF) imaging offers high-precision, long-distance depth sensing, yet demands ultra-short pulse light sources and high-resolution time-to-digital converters. In contrast, indirect time-of-flight (iToF) imaging often suffers from phase wrapping and low signal-to-noise ratio (SNR) as the sensing distance increases. In this paper, we introduce a novel ToF imaging paradigm, termed Burst-Encodable Time-of-Flight (BE-ToF), which facilitates high-fidelity, long-distance depth imaging. Specifically, the BE-ToF system emits light pulses in burst mode and estimates the phase delay of the reflected signal over the entire burst period, thereby effectively avoiding the phase wrapping inherent to conventional iToF systems. Moreover, to address the low SNR caused by light attenuation over increasing distances, we propose an end-to-end learnable framework that jointly optimizes the coding functions and the depth reconstruction network. A specialized double well function and first-order difference term are incorporated into the framework to ensure the hardware implementability of the coding functions. The proposed approach is rigorously validated through comprehensive simulations and real-world prototype experiments, demonstrating its effectiveness and practical applicability.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Meta-PerSER: Few-Shot Listener Personalized Speech Emotion Recognition via Meta-learning
Authors:
Liang-Yeh Shen,
Shi-Xin Fang,
Yi-Cheng Lin,
Huang-Cheng Chou,
Hung-yi Lee
Abstract:
This paper introduces Meta-PerSER, a novel meta-learning framework that personalizes Speech Emotion Recognition (SER) by adapting to each listener's unique way of interpreting emotion. Conventional SER systems rely on aggregated annotations, which often overlook individual subtleties and lead to inconsistent predictions. In contrast, Meta-PerSER leverages a Model-Agnostic Meta-Learning (MAML) appr…
▽ More
This paper introduces Meta-PerSER, a novel meta-learning framework that personalizes Speech Emotion Recognition (SER) by adapting to each listener's unique way of interpreting emotion. Conventional SER systems rely on aggregated annotations, which often overlook individual subtleties and lead to inconsistent predictions. In contrast, Meta-PerSER leverages a Model-Agnostic Meta-Learning (MAML) approach enhanced with Combined-Set Meta-Training, Derivative Annealing, and per-layer per-step learning rates, enabling rapid adaptation with only a few labeled examples. By integrating robust representations from pre-trained self-supervised models, our framework first captures general emotional cues and then fine-tunes itself to personal annotation styles. Experiments on the IEMOCAP corpus demonstrate that Meta-PerSER significantly outperforms baseline methods in both seen and unseen data scenarios, highlighting its promise for personalized emotion recognition.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks
Authors:
Chien-yu Huang,
Wei-Chih Chen,
Shu-wen Yang,
Andy T. Liu,
Chen-An Li,
Yu-Xiang Lin,
Wei-Cheng Tseng,
Anuj Diwan,
Yi-Jen Shih,
Jiatong Shi,
William Chen,
Chih-Kai Yang,
Wenze Ren,
Xuanjun Chen,
Chi-Yuan Hsiao,
Puyuan Peng,
Shih-Heng Wang,
Chun-Yi Kuan,
Ke-Han Lu,
Kai-Wei Chang,
Fabian Ritter-Gutierrez,
Kuan-Po Huang,
Siddhant Arora,
You-Kuan Lin,
Ming To Chuang
, et al. (55 additional authors not shown)
Abstract:
Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluati…
▽ More
Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results show that no model performed well universally. SALMONN-13B excelled in English ASR and Qwen2-Audio-7B-Instruct showed high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We open-source all task data and the evaluation pipeline at https://github.com/dynamic-superb/dynamic-superb.
△ Less
Submitted 9 June, 2025; v1 submitted 8 November, 2024;
originally announced November 2024.
-
Fine-Tuning Hybrid Physics-Informed Neural Networks for Vehicle Dynamics Model Estimation
Authors:
Shiming Fang,
Kaiyan Yu
Abstract:
Accurate dynamic modeling is critical for autonomous racing vehicles, especially during high-speed and agile maneuvers where precise motion prediction is essential for safety. Traditional parameter estimation methods face limitations such as reliance on initial guesses, labor-intensive fitting procedures, and complex testing setups. On the other hand, purely data-driven machine learning methods st…
▽ More
Accurate dynamic modeling is critical for autonomous racing vehicles, especially during high-speed and agile maneuvers where precise motion prediction is essential for safety. Traditional parameter estimation methods face limitations such as reliance on initial guesses, labor-intensive fitting procedures, and complex testing setups. On the other hand, purely data-driven machine learning methods struggle to capture inherent physical constraints and typically require large datasets for optimal performance. To address these challenges, this paper introduces the Fine-Tuning Hybrid Dynamics (FTHD) method, which integrates supervised and unsupervised Physics-Informed Neural Networks (PINNs), combining physics-based modeling with data-driven techniques. FTHD fine-tunes a pre-trained Deep Dynamics Model (DDM) using a smaller training dataset, delivering superior performance compared to state-of-the-art methods such as the Deep Pacejka Model (DPM) and outperforming the original DDM. Furthermore, an Extended Kalman Filter (EKF) is embedded within FTHD (EKF-FTHD) to effectively manage noisy real-world data, ensuring accurate denoising while preserving the vehicle's essential physical characteristics. The proposed FTHD framework is validated through scaled simulations using the BayesRace Physics-based Simulator and full-scale real-world experiments from the Indy Autonomous Challenge. Results demonstrate that the hybrid approach significantly improves parameter estimation accuracy, even with reduced data, and outperforms existing models. EKF-FTHD enhances robustness by denoising real-world data while maintaining physical insights, representing a notable advancement in vehicle dynamics modeling for high-speed autonomous racing.
△ Less
Submitted 29 September, 2024;
originally announced September 2024.
-
USTC-KXDIGIT System Description for ASVspoof5 Challenge
Authors:
Yihao Chen,
Haochen Wu,
Nan Jiang,
Xiang Xia,
Qing Gu,
Yunqi Hao,
Pengfei Cai,
Yu Guan,
Jialong Wang,
Weilin Xie,
Lei Fang,
Sian Fang,
Yan Song,
Wu Guo,
Lin Liu,
Minqiang Xu
Abstract:
This paper describes the USTC-KXDIGIT system submitted to the ASVspoof5 Challenge for Track 1 (speech deepfake detection) and Track 2 (spoofing-robust automatic speaker verification, SASV). Track 1 showcases a diverse range of technical qualities from potential processing algorithms and includes both open and closed conditions. For these conditions, our system consists of a cascade of a frontend f…
▽ More
This paper describes the USTC-KXDIGIT system submitted to the ASVspoof5 Challenge for Track 1 (speech deepfake detection) and Track 2 (spoofing-robust automatic speaker verification, SASV). Track 1 showcases a diverse range of technical qualities from potential processing algorithms and includes both open and closed conditions. For these conditions, our system consists of a cascade of a frontend feature extractor and a back-end classifier. We focus on extensive embedding engineering and enhancing the generalization of the back-end classifier model. Specifically, the embedding engineering is based on hand-crafted features and speech representations from a self-supervised model, used for closed and open conditions, respectively. To detect spoof attacks under various adversarial conditions, we trained multiple systems on an augmented training set. Additionally, we used voice conversion technology to synthesize fake audio from genuine audio in the training set to enrich the synthesis algorithms. To leverage the complementary information learned by different model architectures, we employed activation ensemble and fused scores from different systems to obtain the final decision score for spoof detection. During the evaluation phase, the proposed methods achieved 0.3948 minDCF and 14.33% EER in the close condition, and 0.0750 minDCF and 2.59% EER in the open condition, demonstrating the robustness of our submitted systems under adversarial conditions. In Track 2, we continued using the CM system from Track 1 and fused it with a CNN-based ASV system. This approach achieved 0.2814 min-aDCF in the closed condition and 0.0756 min-aDCF in the open condition, showcasing superior performance in the SASV system.
△ Less
Submitted 3 September, 2024;
originally announced September 2024.
-
CA-FedRC: Codebook Adaptation via Federated Reservoir Computing in 5G NR
Authors:
Ziqiang Ye,
Sikai Liao,
Yulan Gao,
Shu Fang,
Yue Xiao,
Ming Xiao,
Saviour Zammit
Abstract:
With the burgeon deployment of the fifth-generation new radio (5G NR) networks, the codebook plays a crucial role in enabling the base station (BS) to acquire the channel state information (CSI). Different 5G NR codebooks incur varying overheads and exhibit performance disparities under diverse channel conditions, necessitating codebook adaptation based on channel conditions to reduce feedback ove…
▽ More
With the burgeon deployment of the fifth-generation new radio (5G NR) networks, the codebook plays a crucial role in enabling the base station (BS) to acquire the channel state information (CSI). Different 5G NR codebooks incur varying overheads and exhibit performance disparities under diverse channel conditions, necessitating codebook adaptation based on channel conditions to reduce feedback overhead while enhancing performance. However, existing methods of 5G NR codebooks adaptation require significant overhead for model training and feedback or fall short in performance. To address these limitations, this letter introduces a federated reservoir computing framework designed for efficient codebook adaptation in computationally and feedback resource-constrained mobile devices. This framework utilizes a novel series of indicators as input training data, striking an effective balance between performance and feedback overhead. Compared to conventional models, the proposed codebook adaptation via federated reservoir computing (CA-FedRC), achieves rapid convergence and significant loss reduction in both speed and accuracy. Extensive simulations under various channel conditions demonstrate that our algorithm not only reduces resource consumption of users but also accurately identifies channel types, thereby optimizing the trade-off between spectrum efficiency, computational complexity, and feedback overhead.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Unsupervised Face-Masked Speech Enhancement Using Generative Adversarial Networks With Human-in-the-Loop Assessment Metrics
Authors:
Syu-Siang Wang,
Jia-Yang Chen,
Bo-Ren Bai,
Shih-Hau Fang,
Yu Tsao
Abstract:
The utilization of face masks is an essential healthcare measure, particularly during times of pandemics, yet it can present challenges in communication in our daily lives. To address this problem, we propose a novel approach known as the human-in-the-loop StarGAN (HL-StarGAN) face-masked speech enhancement method. HL-StarGAN comprises discriminator, classifier, metric assessment predictor, and ge…
▽ More
The utilization of face masks is an essential healthcare measure, particularly during times of pandemics, yet it can present challenges in communication in our daily lives. To address this problem, we propose a novel approach known as the human-in-the-loop StarGAN (HL-StarGAN) face-masked speech enhancement method. HL-StarGAN comprises discriminator, classifier, metric assessment predictor, and generator that leverages an attention mechanism. The metric assessment predictor, referred to as MaskQSS, incorporates human participants in its development and serves as a "human-in-the-loop" module during the learning process of HL-StarGAN. The overall HL-StarGAN model was trained using an unsupervised learning strategy that simultaneously focuses on the reconstruction of the original clean speech and the optimization of human perception. To implement HL-StarGAN, we curated a face-masked speech database named "FMVD," which comprises recordings from 34 speakers in three distinct face-masked scenarios and a clean condition. We conducted subjective and objective tests on the proposed HL-StarGAN using this database. The outcomes of the test results are as follows: (1) MaskQSS successfully predicted the quality scores of face mask voices, outperforming several existing speech assessment methods. (2) The integration of the MaskQSS predictor enhanced the ability of HL-StarGAN to transform face mask voices into high-quality speech; this enhancement is evident in both objective and subjective tests, outperforming conventional StarGAN and CycleGAN-based systems.
△ Less
Submitted 20 July, 2024; v1 submitted 2 July, 2024;
originally announced July 2024.
-
GDTM: An Indoor Geospatial Tracking Dataset with Distributed Multimodal Sensors
Authors:
Ho Lyun Jeong,
Ziqi Wang,
Colin Samplawski,
Jason Wu,
Shiwei Fang,
Lance M. Kaplan,
Deepak Ganesan,
Benjamin Marlin,
Mani Srivastava
Abstract:
Constantly locating moving objects, i.e., geospatial tracking, is essential for autonomous building infrastructure. Accurate and robust geospatial tracking often leverages multimodal sensor fusion algorithms, which require large datasets with time-aligned, synchronized data from various sensor types. However, such datasets are not readily available. Hence, we propose GDTM, a nine-hour dataset for…
▽ More
Constantly locating moving objects, i.e., geospatial tracking, is essential for autonomous building infrastructure. Accurate and robust geospatial tracking often leverages multimodal sensor fusion algorithms, which require large datasets with time-aligned, synchronized data from various sensor types. However, such datasets are not readily available. Hence, we propose GDTM, a nine-hour dataset for multimodal object tracking with distributed multimodal sensors and reconfigurable sensor node placements. Our dataset enables the exploration of several research problems, such as optimizing architectures for processing multimodal data, and investigating models' robustness to adverse sensing conditions and sensor placement variances. A GitHub repository containing the code, sample data, and checkpoints of this work is available at https://github.com/nesl/GDTM.
△ Less
Submitted 21 February, 2024;
originally announced February 2024.
-
Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks
Authors:
Sizhou Chen,
Songyang Gao,
Sen Fang
Abstract:
The Transformer architecture has proven to be highly effective for Automatic Speech Recognition (ASR) tasks, becoming a foundational component for a plethora of research in the domain. Historically, many approaches have leaned on fixed-length attention windows, which becomes problematic for varied speech samples in duration and complexity, leading to data over-smoothing and neglect of essential lo…
▽ More
The Transformer architecture has proven to be highly effective for Automatic Speech Recognition (ASR) tasks, becoming a foundational component for a plethora of research in the domain. Historically, many approaches have leaned on fixed-length attention windows, which becomes problematic for varied speech samples in duration and complexity, leading to data over-smoothing and neglect of essential long-term connectivity. Addressing this limitation, we introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism that accommodates a range of speech sample complexities and durations. This module offers the flexibility to extract speech features across various granularities, spanning from frames and phonemes to words and discourse. The proposed design captures the variable length feature of speech and addresses the limitations of fixed-length attention. Our evaluation leverages a parallel attention architecture complemented by a dynamic gating mechanism that amalgamates traditional attention with the Echo-MSA module output. Empirical evidence from our study reveals that integrating Echo-MSA into the primary model's training regime significantly enhances the word error rate (WER) performance, all while preserving the intrinsic stability of the original model.
△ Less
Submitted 7 April, 2024; v1 submitted 14 September, 2023;
originally announced September 2023.
-
Open Set Synthetic Image Source Attribution
Authors:
Shengbang Fang,
Tai D. Nguyen,
Matthew C. Stamm
Abstract:
AI-generated images have become increasingly realistic and have garnered significant public attention. While synthetic images are intriguing due to their realism, they also pose an important misinformation threat. To address this new threat, researchers have developed multiple algorithms to detect synthetic images and identify their source generators. However, most existing source attribution tech…
▽ More
AI-generated images have become increasingly realistic and have garnered significant public attention. While synthetic images are intriguing due to their realism, they also pose an important misinformation threat. To address this new threat, researchers have developed multiple algorithms to detect synthetic images and identify their source generators. However, most existing source attribution techniques are designed to operate in a closed-set scenario, i.e. they can only be used to discriminate between known image generators. By contrast, new image-generation techniques are rapidly emerging. To contend with this, there is a great need for open-set source attribution techniques that can identify when synthetic images have originated from new, unseen generators. To address this problem, we propose a new metric learning-based approach. Our technique works by learning transferrable embeddings capable of discriminating between generators, even when they are not seen during training. An image is first assigned to a candidate generator, then is accepted or rejected based on its distance in the embedding space from known generators' learned reference points. Importantly, we identify that initializing our source attribution embedding network by pretraining it on image camera identification can improve our embeddings' transferability. Through a series of experiments, we demonstrate our approach's ability to attribute the source of synthetic images in open-set scenarios.
△ Less
Submitted 22 August, 2023;
originally announced August 2023.
-
UniBriVL: Robust Universal Representation and Generation of Audio Driven Diffusion Models
Authors:
Sen Fang,
Bowen Gao,
Yangjian Wu,
Teik Toe Teoh
Abstract:
Multimodal large models have been recognized for their advantages in various performance and downstream tasks. The development of these models is crucial towards achieving general artificial intelligence in the future. In this paper, we propose a novel universal language representation learning method called UniBriVL, which is based on Bridging-Vision-and-Language (BriVL). Universal BriVL embeds a…
▽ More
Multimodal large models have been recognized for their advantages in various performance and downstream tasks. The development of these models is crucial towards achieving general artificial intelligence in the future. In this paper, we propose a novel universal language representation learning method called UniBriVL, which is based on Bridging-Vision-and-Language (BriVL). Universal BriVL embeds audio, image, and text into a shared space, enabling the realization of various multimodal applications. Our approach addresses major challenges in robust language (both text and audio) representation learning and effectively captures the correlation between audio and image. Additionally, we demonstrate the qualitative evaluation of the generated images from UniBriVL, which serves to highlight the potential of our approach in creating images from audio. Overall, our experimental results demonstrate the efficacy of UniBriVL in downstream tasks and its ability to choose appropriate images from audio. The proposed approach has the potential for various applications such as speech recognition, music signal processing, and captioning systems.
△ Less
Submitted 9 September, 2023; v1 submitted 29 July, 2023;
originally announced July 2023.
-
Transsion TSUP's speech recognition system for ASRU 2023 MADASR Challenge
Authors:
Xiaoxiao Li,
Gaosheng Zhang,
An Zhu,
Weiyong Li,
Shuming Fang,
Xiaoyue Yang,
Jianchao Zhu
Abstract:
This paper presents a speech recognition system developed by the Transsion Speech Understanding Processing Team (TSUP) for the ASRU 2023 MADASR Challenge. The system focuses on adapting ASR models for low-resource Indian languages and covers all four tracks of the challenge. For tracks 1 and 2, the acoustic model utilized a squeezeformer encoder and bidirectional transformer decoder with joint CTC…
▽ More
This paper presents a speech recognition system developed by the Transsion Speech Understanding Processing Team (TSUP) for the ASRU 2023 MADASR Challenge. The system focuses on adapting ASR models for low-resource Indian languages and covers all four tracks of the challenge. For tracks 1 and 2, the acoustic model utilized a squeezeformer encoder and bidirectional transformer decoder with joint CTC-Attention training loss. Additionally, an external KenLM language model was used during TLG beam search decoding. For tracks 3 and 4, pretrained IndicWhisper models were employed and finetuned on both the challenge dataset and publicly available datasets. The whisper beam search decoding was also modified to support an external KenLM language model, which enabled better utilization of the additional text provided by the challenge. The proposed method achieved word error rates (WER) of 24.17%, 24.43%, 15.97%, and 15.97% for Bengali language in the four tracks, and WER of 19.61%, 19.54%, 15.48%, and 15.48% for Bhojpuri language in the four tracks. These results demonstrate the effectiveness of the proposed method.
△ Less
Submitted 19 July, 2023;
originally announced July 2023.
-
Exploring Efficient-Tuned Learning Audio Representation Method from BriVL
Authors:
Sen Fang,
Yangjian Wu,
Bowen Gao,
Jingwen Cai,
Teik Toe Teoh
Abstract:
Recently, researchers have gradually realized that in some cases, the self-supervised pre-training on large-scale Internet data is better than that of high-quality/manually labeled data sets, and multimodal/large models are better than single or bimodal/small models. In this paper, we propose a robust audio representation learning method WavBriVL based on Bridging-Vision-and-Language (BriVL). WavB…
▽ More
Recently, researchers have gradually realized that in some cases, the self-supervised pre-training on large-scale Internet data is better than that of high-quality/manually labeled data sets, and multimodal/large models are better than single or bimodal/small models. In this paper, we propose a robust audio representation learning method WavBriVL based on Bridging-Vision-and-Language (BriVL). WavBriVL projects audio, image and text into a shared embedded space, so that multi-modal applications can be realized. We demonstrate the qualitative evaluation of the image generated from WavBriVL as a shared embedded space, with the main purposes of this paper:(1) Learning the correlation between audio and image;(2) Explore a new way of image generation, that is, use audio to generate pictures. Experimental results show that this method can effectively generate appropriate images from audio.
△ Less
Submitted 28 July, 2023; v1 submitted 8 March, 2023;
originally announced March 2023.
-
CarFi: Rider Localization Using Wi-Fi CSI
Authors:
Sirajum Munir,
Hongkai Chen,
Shiwei Fang,
Mahathir Monjur,
Shan Lin,
Shahriar Nirjon
Abstract:
With the rise of hailing services, people are increasingly relying on shared mobility (e.g., Uber, Lyft) drivers to pick up for transportation. However, such drivers and riders have difficulties finding each other in urban areas as GPS signals get blocked by skyscrapers, in crowded environments (e.g., in stadiums, airports, and bars), at night, and in bad weather. It wastes their time, creates a b…
▽ More
With the rise of hailing services, people are increasingly relying on shared mobility (e.g., Uber, Lyft) drivers to pick up for transportation. However, such drivers and riders have difficulties finding each other in urban areas as GPS signals get blocked by skyscrapers, in crowded environments (e.g., in stadiums, airports, and bars), at night, and in bad weather. It wastes their time, creates a bad user experience, and causes more CO2 emissions due to idle driving. In this work, we explore the potential of Wi-Fi to help drivers to determine the street side of the riders. Our proposed system is called CarFi that uses Wi-Fi CSI from two antennas placed inside a moving vehicle and a data-driven technique to determine the street side of the rider. By collecting real-world data in realistic and challenging settings by blocking the signal with other people and other parked cars, we see that CarFi is 95.44% accurate in rider-side determination in both line of sight (LoS) and non-line of sight (nLoS) conditions, and can be run on an embedded GPU in real-time.
△ Less
Submitted 21 December, 2022;
originally announced January 2023.
-
Attacking Image Splicing Detection and Localization Algorithms Using Synthetic Traces
Authors:
Shengbang Fang,
Matthew C Stamm
Abstract:
Recent advances in deep learning have enabled forensics researchers to develop a new class of image splicing detection and localization algorithms. These algorithms identify spliced content by detecting localized inconsistencies in forensic traces using Siamese neural networks, either explicitly during analysis or implicitly during training. At the same time, deep learning has enabled new forms of…
▽ More
Recent advances in deep learning have enabled forensics researchers to develop a new class of image splicing detection and localization algorithms. These algorithms identify spliced content by detecting localized inconsistencies in forensic traces using Siamese neural networks, either explicitly during analysis or implicitly during training. At the same time, deep learning has enabled new forms of anti-forensic attacks, such as adversarial examples and generative adversarial network (GAN) based attacks. Thus far, however, no anti-forensic attack has been demonstrated against image splicing detection and localization algorithms. In this paper, we propose a new GAN-based anti-forensic attack that is able to fool state-of-the-art splicing detection and localization algorithms such as EXIF-Net, Noiseprint, and Forensic Similarity Graphs. This attack operates by adversarially training an anti-forensic generator against a set of Siamese neural networks so that it is able to create synthetic forensic traces. Under analysis, these synthetic traces appear authentic and are self-consistent throughout an image. Through a series of experiments, we demonstrate that our attack is capable of fooling forensic splicing detection and localization algorithms without introducing visually detectable artifacts into an attacked image. Additionally, we demonstrate that our attack outperforms existing alternative attack approaches. %
△ Less
Submitted 22 November, 2022;
originally announced November 2022.
-
Encoding feature supervised UNet++: Redesigning Supervision for liver and tumor segmentation
Authors:
Jiahao Cui,
Ruoxin Xiao,
Shiyuan Fang,
Minnan Pei,
Yixuan Yu
Abstract:
Liver tumor segmentation in CT images is a critical step in the diagnosis, surgical planning and postoperative evaluation of liver disease. An automatic liver and tumor segmentation method can greatly relieve physicians of the heavy workload of examining CT images and better improve the accuracy of diagnosis. In the last few decades, many modifications based on U-Net model have been proposed in th…
▽ More
Liver tumor segmentation in CT images is a critical step in the diagnosis, surgical planning and postoperative evaluation of liver disease. An automatic liver and tumor segmentation method can greatly relieve physicians of the heavy workload of examining CT images and better improve the accuracy of diagnosis. In the last few decades, many modifications based on U-Net model have been proposed in the literature. However, there are relatively few improvements for the advanced UNet++ model. In our paper, we propose an encoding feature supervised UNet++(ES-UNet++) and apply it to the liver and tumor segmentation. ES-UNet++ consists of an encoding UNet++ and a segmentation UNet++. The well-trained encoding UNet++ can extract the encoding features of label map which are used to additionally supervise the segmentation UNet++. By adding supervision to the each encoder of segmentation UNet++, U-Nets of different depths that constitute UNet++ outperform the original version by average 5.7% in dice score and the overall dice score is thus improved by 2.1%. ES-UNet++ is evaluated with dataset LiTS, achieving 95.6% for liver segmentation and 67.4% for tumor segmentation in dice score. In this paper, we also concluded some valuable properties of ES-UNet++ by conducting comparative anaylsis between ES-UNet++ and UNet++:(1) encoding feature supervision can accelerate the convergence of the model.(2) encoding feature supervision enhances the effect of model pruning by achieving huge speedup while providing pruned models with fairly good performance.
△ Less
Submitted 15 November, 2022;
originally announced November 2022.
-
ERASE-Net: Efficient Segmentation Networks for Automotive Radar Signals
Authors:
Shihong Fang,
Haoran Zhu,
Devansh Bisla,
Anna Choromanska,
Satish Ravindran,
Dongyin Ren,
Ryan Wu
Abstract:
Among various sensors for assisted and autonomous driving systems, automotive radar has been considered as a robust and low-cost solution even in adverse weather or lighting conditions. With the recent development of radar technologies and open-sourced annotated data sets, semantic segmentation with radar signals has become very promising. However, existing methods are either computationally expen…
▽ More
Among various sensors for assisted and autonomous driving systems, automotive radar has been considered as a robust and low-cost solution even in adverse weather or lighting conditions. With the recent development of radar technologies and open-sourced annotated data sets, semantic segmentation with radar signals has become very promising. However, existing methods are either computationally expensive or discard significant amounts of valuable information from raw 3D radar signals by reducing them to 2D planes via averaging. In this work, we introduce ERASE-Net, an Efficient RAdar SEgmentation Network to segment the raw radar signals semantically. The core of our approach is the novel detect-then-segment method for raw radar signals. It first detects the center point of each object, then extracts a compact radar signal representation, and finally performs semantic segmentation. We show that our method can achieve superior performance on radar semantic segmentation task compared to the state-of-the-art (SOTA) technique. Furthermore, our approach requires up to 20x less computational resources. Finally, we show that the proposed ERASE-Net can be compressed by 40% without significant loss in performance, significantly more than the SOTA network, which makes it a more promising candidate for practical automotive applications.
△ Less
Submitted 24 February, 2023; v1 submitted 26 September, 2022;
originally announced September 2022.
-
Intelligent Omni Surface-Assisted Self-Interference Cancellation for Full-Duplex MISO System
Authors:
Sisai Fang,
Gaojie Chen,
Pei Xiao,
Kai-Kit Wong,
Rahim Tafazolli
Abstract:
The full-duplex (FD) communication can achieve higher spectrum efficiency than conventional half-duplex (HD) communication; however, self-interference (SI) is the key hurdle. This paper is the first work to propose the intelligent Omni surface (IOS)-assisted FD multi-input single-output (MISO) FD communication systems to mitigate SI, which solves the frequency-selectivity issue. In particular, two…
▽ More
The full-duplex (FD) communication can achieve higher spectrum efficiency than conventional half-duplex (HD) communication; however, self-interference (SI) is the key hurdle. This paper is the first work to propose the intelligent Omni surface (IOS)-assisted FD multi-input single-output (MISO) FD communication systems to mitigate SI, which solves the frequency-selectivity issue. In particular, two types of IOS are proposed, energy splitting (ES)-IOS and mode switching (MS)-IOS. We aim to maximize data rate and minimize SI power by optimizing the beamforming vectors, amplitudes and phase shifts for the ES-IOS and the mode selection and phase shifts for the MS-IOS. However, the formulated problems are non-convex and challenging to tackle directly. Thus, we design alternative optimization algorithms to solve the problems iteratively. Specifically, the quadratic constraint quadratic programming (QCQP) is employed for the beamforming optimizations, amplitudes and phase shifts optimizations for the ES-IOS and phase shifts optimizations for the MS-IOS. Nevertheless, the binary variables of the MS-IOS render the mode selection optimization intractable, and then we resort to semidefinite relaxation (SDR) and Gaussian randomization procedure to solve it. Simulation results validate the proposed algorithms' efficacy and show the effectiveness of both the IOSs in mitigating SI compared to the case without an IOS.
△ Less
Submitted 12 August, 2022;
originally announced August 2022.
-
Continuous Speech for Improved Learning Pathological Voice Disorders
Authors:
Syu-Siang Wang,
Chi-Te Wang,
Chih-Chung Lai,
Yu Tsao,
Shih-Hau Fang
Abstract:
Goal: Numerous studies had successfully differentiated normal and abnormal voice samples. Nevertheless, further classification had rarely been attempted. This study proposes a novel approach, using continuous Mandarin speech instead of a single vowel, to classify four common voice disorders (i.e. functional dysphonia, neoplasm, phonotrauma, and vocal palsy). Methods: In the proposed framework, aco…
▽ More
Goal: Numerous studies had successfully differentiated normal and abnormal voice samples. Nevertheless, further classification had rarely been attempted. This study proposes a novel approach, using continuous Mandarin speech instead of a single vowel, to classify four common voice disorders (i.e. functional dysphonia, neoplasm, phonotrauma, and vocal palsy). Methods: In the proposed framework, acoustic signals are transformed into mel-frequency cepstral coefficients, and a bi-directional long-short term memory network (BiLSTM) is adopted to model the sequential features. The experiments were conducted on a large-scale database, wherein 1,045 continuous speech were collected by the speech clinic of a hospital from 2012 to 2019. Results: Experimental results demonstrated that the proposed framework yields significant accuracy and unweighted average recall improvements of 78.12-89.27% and 50.92-80.68%, respectively, compared with systems that use a single vowel. Conclusions: The results are consistent with other machine learning algorithms, including gated recurrent units, random forest, deep neural networks, and LSTM. The sensitivities for each disorder were also analyzed, and the model capabilities were visualized via principal component analysis. An alternative experiment based on a balanced dataset again confirms the advantages of using continuous speech for learning voice disorders.
△ Less
Submitted 22 February, 2022;
originally announced February 2022.
-
Keeping Deep Lithography Simulators Updated: Global-Local Shape-Based Novelty Detection and Active Learning
Authors:
Hao-Chiang Shao,
Hsing-Lei Ping,
Kuo-shiuan Chen,
Weng-Tai Su,
Chia-Wen Lin,
Shao-Yun Fang,
Pin-Yian Tsai,
Yan-Hsiu Liu
Abstract:
Learning-based pre-simulation (i.e., layout-to-fabrication) models have been proposed to predict the fabrication-induced shape deformation from an IC layout to its fabricated circuit. Such models are usually driven by pairwise learning, involving a training set of layout patterns and their reference shape images after fabrication. However, it is expensive and time-consuming to collect the referenc…
▽ More
Learning-based pre-simulation (i.e., layout-to-fabrication) models have been proposed to predict the fabrication-induced shape deformation from an IC layout to its fabricated circuit. Such models are usually driven by pairwise learning, involving a training set of layout patterns and their reference shape images after fabrication. However, it is expensive and time-consuming to collect the reference shape images of all layout clips for model training and updating. To address the problem, we propose a deep learning-based layout novelty detection scheme to identify novel (unseen) layout patterns, which cannot be well predicted by a pre-trained pre-simulation model. We devise a global-local novelty scoring mechanism to assess the potential novelty of a layout by exploiting two subnetworks: an autoencoder and a pretrained pre-simulation model. The former characterizes the global structural dissimilarity between a given layout and training samples, whereas the latter extracts a latent code representing the fabrication-induced local deformation. By integrating the global dissimilarity with the local deformation boosted by a self-attention mechanism, our model can accurately detect novelties without the ground-truth circuit shapes of test samples. Based on the detected novelties, we further propose two active-learning strategies to sample a reduced amount of representative layouts most worthy to be fabricated for acquiring their ground-truth circuit shapes. Experimental results demonstrate i) our method's effectiveness in layout novelty detection, and ii) our active-learning strategies' ability in selecting representative novel layouts for keeping a learning-based pre-simulation model updated.
△ Less
Submitted 24 January, 2022;
originally announced January 2022.
-
Toward Real-World Voice Disorder Classification
Authors:
Heng-Cheng Kuo,
Yu-Peng Hsieh,
Huan-Hsin Tseng,
Chi-Te Wang,
Shih-Hau Fang,
Yu Tsao
Abstract:
Objective: Voice disorders significantly compromise individuals' ability to speak in their daily lives. Without early diagnosis and treatment, these disorders may deteriorate drastically. Thus, automatic classification systems at home are desirable for people who are inaccessible to clinical disease assessments. However, the performance of such systems may be weakened due to the constrained resour…
▽ More
Objective: Voice disorders significantly compromise individuals' ability to speak in their daily lives. Without early diagnosis and treatment, these disorders may deteriorate drastically. Thus, automatic classification systems at home are desirable for people who are inaccessible to clinical disease assessments. However, the performance of such systems may be weakened due to the constrained resources and domain mismatch between the clinical data and noisy real-world data. Methods: This study develops a compact and domain-robust voice disorder classification system to identify the utterances of health, neoplasm, and benign structural diseases. Our proposed system utilizes a feature extractor model composed of factorized convolutional neural networks and subsequently deploys domain adversarial training to reconcile the domain mismatch by extracting domain invariant features. Results: The results show that the unweighted average recall in the noisy real-world domain improved by 13% and remained at 80% in the clinic domain with only slight degradation. The domain mismatch was effectively eliminated. Moreover, the proposed system reduced the usage of both memory and computation by over 73.9%. Conclusion: By deploying factorized convolutional neural networks and domain adversarial training, domain-invariant features can be derived for voice disorder classification with limited resources. The promising results confirm that the proposed system can significantly reduce resource consumption and improve classification accuracy by considering the domain mismatch. Significance: To the best of our knowledge, this is the first study that jointly considers real-world model compression and noise-robustness issues in voice disorder classification. The proposed system is intended for application to embedded systems with limited resources.
△ Less
Submitted 26 April, 2023; v1 submitted 5 December, 2021;
originally announced December 2021.
-
Feedback Capacity of Parallel ACGN Channels and Kalman Filter: Power Allocation with Feedback
Authors:
Song Fang,
Quanyan Zhu
Abstract:
In this paper, we relate the feedback capacity of parallel additive colored Gaussian noise (ACGN) channels to a variant of the Kalman filter. By doing so, we obtain lower bounds on the feedback capacity of such channels, as well as the corresponding feedback (recursive) coding schemes, which are essentially power allocation policies with feedback, to achieve the bounds. The results are seen to red…
▽ More
In this paper, we relate the feedback capacity of parallel additive colored Gaussian noise (ACGN) channels to a variant of the Kalman filter. By doing so, we obtain lower bounds on the feedback capacity of such channels, as well as the corresponding feedback (recursive) coding schemes, which are essentially power allocation policies with feedback, to achieve the bounds. The results are seen to reduce to existing lower bounds in the case of a single ACGN feedback channel, whereas when it comes to parallel additive white Gaussian noise (AWGN) channels with feedback, the recursive coding scheme reduces to a feedback "water-filling" power allocation policy.
△ Less
Submitted 15 February, 2021; v1 submitted 4 February, 2021;
originally announced February 2021.
-
Optimal Energy Scheduling and Sensitivity Analysis for Integrated Power-Water-Heat Systems
Authors:
Sidun Fang,
Chenxu Wang,
Yashen Lin,
Changhong Zhao
Abstract:
The conventionally independent power, water, and heating networks are becoming more tightly connected, which motivates their joint optimal energy scheduling to improve the overall efficiency of an integrated energy system. However, such a joint optimization is known as a challenging problem with complex network constraints and couplings of electric, hydraulic, and thermal models that are nonlinear…
▽ More
The conventionally independent power, water, and heating networks are becoming more tightly connected, which motivates their joint optimal energy scheduling to improve the overall efficiency of an integrated energy system. However, such a joint optimization is known as a challenging problem with complex network constraints and couplings of electric, hydraulic, and thermal models that are nonlinear and nonconvex. We formulate an optimal power-water-heat flow (OPWHF) problem and develop a computationally efficient heuristic to solve it. The proposed heuristic decomposes OPWHF into subproblems, which are iteratively solved via convex relaxation and convex-concave procedure. Simulation results validate that the proposed framework can improve operational flexibility and social welfare of the integrated system, wherein the water and heating networks respond as virtual energy storage to time-varying energy prices and solar photovoltaic generation. Moreover, we perform sensitivity analysis to compare two modes of heating network control: by flow rate and by temperature. Our results reveal that the latter is more effective for heating networks with a wider space of pipeline parameters.
△ Less
Submitted 25 November, 2021; v1 submitted 1 February, 2021;
originally announced February 2021.
-
Relativistic Rocket Control (Relativistic Space-Travel Flight Control): Feedback Control of Relativistic Dynamics Propelled by Ejecting Mass
Authors:
Song Fang,
Quanyan Zhu
Abstract:
In this short note, we investigate the feedback control of relativistic dynamics propelled by mass ejection, modeling, e.g., the relativistic rocket control or the relativistic (space-travel) flight control. As an extreme case, we also examine the control of relativistic photon rockets which are propelled by ejecting photons.
In this short note, we investigate the feedback control of relativistic dynamics propelled by mass ejection, modeling, e.g., the relativistic rocket control or the relativistic (space-travel) flight control. As an extreme case, we also examine the control of relativistic photon rockets which are propelled by ejecting photons.
△ Less
Submitted 11 January, 2021; v1 submitted 29 December, 2020;
originally announced January 2021.
-
Fundamental Limits of Controlled Stochastic Dynamical Systems: An Information-Theoretic Approach
Authors:
Song Fang,
Quanyan Zhu
Abstract:
In this paper, we examine the fundamental performance limitations in the control of stochastic dynamical systems; more specifically, we derive generic $\mathcal{L}_p$ bounds that hold for any causal (stabilizing) controllers and any stochastic disturbances, by an information-theoretic analysis. We first consider the scenario where the plant (i.e., the dynamical system to be controlled) is linear t…
▽ More
In this paper, we examine the fundamental performance limitations in the control of stochastic dynamical systems; more specifically, we derive generic $\mathcal{L}_p$ bounds that hold for any causal (stabilizing) controllers and any stochastic disturbances, by an information-theoretic analysis. We first consider the scenario where the plant (i.e., the dynamical system to be controlled) is linear time-invariant, and it is seen in general that the lower bounds are characterized by the unstable poles (or nonminimum-phase zeros) of the plant as well as the conditional entropy of the disturbance. We then analyze the setting where the plant is assumed to be (strictly) causal, for which case the lower bounds are determined by the conditional entropy of the disturbance. We also discuss the special cases of $p = 2$ and $p = \infty$, which correspond to minimum-variance control and controlling the maximum deviations, respectively. In addition, we investigate the power-spectral characterization of the lower bounds as well as its relation to the Kolmogorov-Szegö formula.
△ Less
Submitted 3 June, 2021; v1 submitted 22 December, 2020;
originally announced December 2020.
-
Blind Monaural Source Separation on Heart and Lung Sounds Based on Periodic-Coded Deep Autoencoder
Authors:
Kun-Hsi Tsai,
Wei-Chien Wang,
Chui-Hsuan Cheng,
Chan-Yen Tsai,
Jou-Kou Wang,
Tzu-Hao Lin,
Shih-Hau Fang,
Li-Chin Chen,
Yu Tsao
Abstract:
Auscultation is the most efficient way to diagnose cardiovascular and respiratory diseases. To reach accurate diagnoses, a device must be able to recognize heart and lung sounds from various clinical situations. However, the recorded chest sounds are mixed by heart and lung sounds. Thus, effectively separating these two sounds is critical in the pre-processing stage. Recent advances in machine lea…
▽ More
Auscultation is the most efficient way to diagnose cardiovascular and respiratory diseases. To reach accurate diagnoses, a device must be able to recognize heart and lung sounds from various clinical situations. However, the recorded chest sounds are mixed by heart and lung sounds. Thus, effectively separating these two sounds is critical in the pre-processing stage. Recent advances in machine learning have progressed on monaural source separations, but most of the well-known techniques require paired mixed sounds and individual pure sounds for model training. As the preparation of pure heart and lung sounds is difficult, special designs must be considered to derive effective heart and lung sound separation techniques. In this study, we proposed a novel periodicity-coded deep auto-encoder (PC-DAE) approach to separate mixed heart-lung sounds in an unsupervised manner via the assumption of different periodicities between heart rate and respiration rate. The PC-DAE benefits from deep-learning-based models by extracting representative features and considers the periodicity of heart and lung sounds to carry out the separation. We evaluated PC-DAE on two datasets. The first one includes sounds from the Student Auscultation Manikin (SAM), and the second is prepared by recording chest sounds in real-world conditions. Experimental results indicate that PC-DAE outperforms several well-known separations works in terms of standardized evaluation metrics. Moreover, waveforms and spectrograms demonstrate the effectiveness of PC-DAE compared to existing approaches. It is also confirmed that by using the proposed PC-DAE as a pre-processing stage, the heart sound recognition accuracies can be notably boosted. The experimental results confirmed the effectiveness of PC-DAE and its potential to be used in clinical applications.
△ Less
Submitted 11 December, 2020;
originally announced December 2020.
-
The Spectral-Domain $\mathcal{W}_2$ Wasserstein Distance for Elliptical Processes and the Spectral-Domain Gelbrich Bound
Authors:
Song Fang,
Quanyan Zhu
Abstract:
In this short note, we introduce the spectral-domain $\mathcal{W}_2$ Wasserstein distance for elliptical stochastic processes in terms of their power spectra. We also introduce the spectral-domain Gelbrich bound for processes that are not necessarily elliptical.
In this short note, we introduce the spectral-domain $\mathcal{W}_2$ Wasserstein distance for elliptical stochastic processes in terms of their power spectra. We also introduce the spectral-domain Gelbrich bound for processes that are not necessarily elliptical.
△ Less
Submitted 6 January, 2021; v1 submitted 7 December, 2020;
originally announced December 2020.
-
Independent Elliptical Distributions Minimize Their $\mathcal{W}_2$ Wasserstein Distance from Independent Elliptical Distributions with the Same Density Generator
Authors:
Song Fang,
Quanyan Zhu
Abstract:
This short note is on a property of the $\mathcal{W}_2$ Wasserstein distance which indicates that independent elliptical distributions minimize their $\mathcal{W}_2$ Wasserstein distance from given independent elliptical distributions with the same density generators. Furthermore, we examine the implications of this property in the Gelbrich bound when the distributions are not necessarily elliptic…
▽ More
This short note is on a property of the $\mathcal{W}_2$ Wasserstein distance which indicates that independent elliptical distributions minimize their $\mathcal{W}_2$ Wasserstein distance from given independent elliptical distributions with the same density generators. Furthermore, we examine the implications of this property in the Gelbrich bound when the distributions are not necessarily elliptical. Meanwhile, we also generalize the results to the cases when the distributions are not independent. The primary purpose of this note is for the referencing of papers that need to make use of this property or its implications.
△ Less
Submitted 7 December, 2020;
originally announced December 2020.
-
Fundamental Stealthiness-Distortion Tradeoffs in Dynamical Systems under Injection Attacks: A Power Spectral Analysis
Authors:
Song Fang,
Quanyan Zhu
Abstract:
In this paper, we analyze the fundamental stealthiness-distortion tradeoffs of linear Gaussian dynamical systems under data injection attacks using a power spectral analysis, whereas the Kullback-Leibler (KL) divergence is employed as the stealthiness measure. Particularly, we obtain explicit formulas in terms of power spectra that characterize analytically the stealthiness-distortion tradeoffs as…
▽ More
In this paper, we analyze the fundamental stealthiness-distortion tradeoffs of linear Gaussian dynamical systems under data injection attacks using a power spectral analysis, whereas the Kullback-Leibler (KL) divergence is employed as the stealthiness measure. Particularly, we obtain explicit formulas in terms of power spectra that characterize analytically the stealthiness-distortion tradeoffs as well as the properties of the worst-case attacks. Furthermore, it is seen in general that the attacker only needs to know the input-output behaviors of the systems in order to carry out the worst-case attacks.
△ Less
Submitted 11 May, 2021; v1 submitted 3 December, 2020;
originally announced December 2020.
-
Independent Gaussian Distributions Minimize the Kullback-Leibler (KL) Divergence from Independent Gaussian Distributions
Authors:
Song Fang,
Quanyan Zhu
Abstract:
This short note is on a property of the Kullback-Leibler (KL) divergence which indicates that independent Gaussian distributions minimize the KL divergence from given independent Gaussian distributions. The primary purpose of this note is for the referencing of papers that need to make use of this property entirely or partially.
This short note is on a property of the Kullback-Leibler (KL) divergence which indicates that independent Gaussian distributions minimize the KL divergence from given independent Gaussian distributions. The primary purpose of this note is for the referencing of papers that need to make use of this property entirely or partially.
△ Less
Submitted 3 December, 2020; v1 submitted 4 November, 2020;
originally announced November 2020.
-
Fundamental Limits of Obfuscation for Linear Gaussian Dynamical Systems: An Information-Theoretic Approach
Authors:
Song Fang,
Quanyan Zhu
Abstract:
In this paper, we study the fundamental limits of obfuscation in terms of privacy-distortion tradeoffs for linear Gaussian dynamical systems via an information-theoretic approach. Particularly, we obtain analytical formulas that capture the fundamental privacy-distortion tradeoffs when privacy masks are to be added to the outputs of the dynamical systems, while indicating explicitly how to design…
▽ More
In this paper, we study the fundamental limits of obfuscation in terms of privacy-distortion tradeoffs for linear Gaussian dynamical systems via an information-theoretic approach. Particularly, we obtain analytical formulas that capture the fundamental privacy-distortion tradeoffs when privacy masks are to be added to the outputs of the dynamical systems, while indicating explicitly how to design the privacy masks in an optimal way: The privacy masks should be colored Gaussian with power spectra shaped specifically based upon the system and noise properties.
△ Less
Submitted 29 October, 2020;
originally announced November 2020.
-
Channel Leakage, Information-Theoretic Limitations of Obfuscation, and Optimal Privacy Mask Design for Streaming Data
Authors:
Song Fang,
Quanyan Zhu
Abstract:
In this paper, we first introduce the notion of channel leakage as the minimum mutual information between the channel input and channel output. As its name indicates, channel leakage quantifies the minimum information leakage to the malicious receiver. In a broad sense, it can be viewed as a dual concept of channel capacity, which characterizes the maximum information transmission to the targeted…
▽ More
In this paper, we first introduce the notion of channel leakage as the minimum mutual information between the channel input and channel output. As its name indicates, channel leakage quantifies the minimum information leakage to the malicious receiver. In a broad sense, it can be viewed as a dual concept of channel capacity, which characterizes the maximum information transmission to the targeted receiver. We obtain explicit formulas of channel leakage for the white Gaussian case, the colored Gaussian case, and the fading case. We then utilize this notion to investigate the fundamental limitations of obfuscation in terms of privacy-distortion tradeoffs (as well as privacy-power tradeoffs) for streaming data; particularly, we derive analytical tradeoff equations for the stationary case, the non-stationary case, and the finite-time case. Our results also indicate explicitly how to design the privacy masks in an optimal way.
△ Less
Submitted 29 September, 2020; v1 submitted 11 August, 2020;
originally announced August 2020.
-
Modeling, Analysis, and Optimization of Grant-Free NOMA in Massive MTC via Stochastic Geometry
Authors:
Jiaqi Liu,
Gang Wu,
Xiaoxu Zhang,
Shu Fang,
Shaoqian Li
Abstract:
Massive machine-type communications (mMTC) is a crucial scenario to support booming Internet of Things (IoTs) applications. In mMTC, although a large number of devices are registered to an access point (AP), very few of them are active with uplink short packet transmission at the same time, which requires novel design of protocols and receivers to enable efficient data transmission and accurate mu…
▽ More
Massive machine-type communications (mMTC) is a crucial scenario to support booming Internet of Things (IoTs) applications. In mMTC, although a large number of devices are registered to an access point (AP), very few of them are active with uplink short packet transmission at the same time, which requires novel design of protocols and receivers to enable efficient data transmission and accurate multi-user detection (MUD). Aiming at this problem, grant-free non-orthogonal multiple access (GF-NOMA) protocol is proposed. In GF-NOMA, active devices can directly transmit their preambles and data symbols altogether within one time frame, without grant from the AP. Compressive sensing (CS)-based receivers are adopted for non-orthogonal preambles (NOP)-based MUD, and successive interference cancellation is exploited to decode the superimposed data signals. In this paper, we model, analyze, and optimize the CS-based GF-MONA mMTC system via stochastic geometry (SG), from an aspect of network deployment. Based on the SG network model, we first analyze the success probability as well as the channel estimation error of the CS-based MUD in the preamble phase and then analyze the average aggregate data rate in the data phase. As IoT applications highly demands low energy consumption, low infrastructure cost, and flexible deployment, we optimize the energy efficiency and AP coverage efficiency of GF-NOMA via numerical methods. The validity of our analysis is verified via Monte Carlo simulations. Simulation results also show that CS-based GF-NOMA with NOP yields better MUD and data rate performances than contention-based GF-NOMA with orthogonal preambles and CS-based grant-free orthogonal multiple access.
△ Less
Submitted 5 April, 2020;
originally announced April 2020.
-
From IC Layout to Die Photo: A CNN-Based Data-Driven Approach
Authors:
Hao-Chiang Shao,
Chao-Yi Peng,
Jun-Rei Wu,
Chia-Wen Lin,
Shao-Yun Fang,
Pin-Yen Tsai,
Yan-Hsiu Liu
Abstract:
We propose a deep learning-based data-driven framework consisting of two convolutional neural networks: i) LithoNet that predicts the shape deformations on a circuit due to IC fabrication, and ii) OPCNet that suggests IC layout corrections to compensate for such shape deformations. By learning the shape correspondences between pairs of layout design patterns and their scanning electron microscope…
▽ More
We propose a deep learning-based data-driven framework consisting of two convolutional neural networks: i) LithoNet that predicts the shape deformations on a circuit due to IC fabrication, and ii) OPCNet that suggests IC layout corrections to compensate for such shape deformations. By learning the shape correspondences between pairs of layout design patterns and their scanning electron microscope (SEM) images of the product wafer thereof, given an IC layout pattern, LithoNet can mimic the fabrication process to predict its fabricated circuit shape. Furthermore, LithoNet can take the wafer fabrication parameters as a latent vector to model the parametric product variations that can be inspected on SEM images. Besides, traditional optical proximity correction (OPC) methods used to suggest a correction on a lithographic photomask is computationally expensive. Our proposed OPCNet mimics the OPC procedure and efficiently generates a corrected photomask by collaborating with LithoNet to examine if the shape of a fabricated circuit optimally matches its original layout design. As a result, the proposed LithoNet-OPCNet framework can not only predict the shape of a fabricated IC from its layout pattern, but also suggests a layout correction according to the consistency between the predicted shape and the given layout. Experimental results with several benchmark layout patterns demonstrate the effectiveness of the proposed method.
△ Less
Submitted 6 August, 2020; v1 submitted 10 February, 2020;
originally announced February 2020.
-
Feedback Capacity and a Variant of the Kalman Filter with ARMA Gaussian Noises: Explicit Bounds and Feedback Coding Design
Authors:
Song Fang,
Quanyan Zhu
Abstract:
In this paper, we relate a feedback channel with any finite-order autoregressive moving-average (ARMA) Gaussian noises to a variant of the Kalman filter. In light of this, we obtain relatively explicit lower bounds on the feedback capacity for such colored Gaussian noises, and the bounds are seen to be consistent with various existing results in the literature. Meanwhile, this variant of the Kalma…
▽ More
In this paper, we relate a feedback channel with any finite-order autoregressive moving-average (ARMA) Gaussian noises to a variant of the Kalman filter. In light of this, we obtain relatively explicit lower bounds on the feedback capacity for such colored Gaussian noises, and the bounds are seen to be consistent with various existing results in the literature. Meanwhile, this variant of the Kalman filter also leads to explicit recursive coding schemes with clear structures to achieve the lower bounds. In general, our results provide an alternative perspective while pointing to potentially tighter bounds for the feedback capacity problem.
△ Less
Submitted 3 June, 2021; v1 submitted 9 January, 2020;
originally announced January 2020.
-
Information-Theoretic Performance Limitations of Feedback Control: Underlying Entropic Laws and Generic $\mathcal{L}_{p}$ Bounds
Authors:
Song Fang,
Quanyan Zhu
Abstract:
In this paper, we utilize information theory to study the fundamental performance limitations of generic feedback systems, where both the controller and the plant may be any causal functions/mappings while the disturbance can be with any distributions. More specifically, we obtain fundamental $\mathcal{L}_p$ bounds on the control error, which are shown to be completely characterized by the conditi…
▽ More
In this paper, we utilize information theory to study the fundamental performance limitations of generic feedback systems, where both the controller and the plant may be any causal functions/mappings while the disturbance can be with any distributions. More specifically, we obtain fundamental $\mathcal{L}_p$ bounds on the control error, which are shown to be completely characterized by the conditional entropy of the disturbance, based upon the entropic laws that are inherent in any feedback systems. We also discuss the generality and implications (in, e.g., fundamental limits of learning-based control) of the obtained bounds.
△ Less
Submitted 6 May, 2021; v1 submitted 11 December, 2019;
originally announced December 2019.
-
Relativistic Control: Feedback Control of Relativistic Dynamics
Authors:
Song Fang,
Quanyan Zhu
Abstract:
Strictly speaking, Newton's second law of motion is only an approximation of the so-called relativistic dynamics, i.e., Einstein's modification of the second law based on his theory of special relativity. Although the approximation is almost exact when the velocity of the dynamical system is far less than the speed of light, the difference will become larger and larger (and will eventually go to i…
▽ More
Strictly speaking, Newton's second law of motion is only an approximation of the so-called relativistic dynamics, i.e., Einstein's modification of the second law based on his theory of special relativity. Although the approximation is almost exact when the velocity of the dynamical system is far less than the speed of light, the difference will become larger and larger (and will eventually go to infinity) as the velocity approaches the speed of light. Correspondingly, feedback control of such dynamics should also take this modification into consideration (though it will render the system nonlinear), especially when the velocity is relatively large. Towards this end, we start this note by studying the state-space representation of the relativistic dynamics. We then investigate on how to employ the feedback linearization approach for such relativistic dynamics, based upon which an additional linear controller may then be designed. As such, the feedback linearization together with the linear controller compose the overall relativistic feedback control law. We also provide discussions on, e.g., controllability, state feedback and output feedback, as well as PID control, in the relativistic setting.
△ Less
Submitted 13 January, 2021; v1 submitted 6 December, 2019;
originally announced December 2019.
-
Fundamental Limitations in Sequential Prediction and Recursive Algorithms: $\mathcal{L}_{p}$ Bounds via an Entropic Analysis
Authors:
Song Fang,
Quanyan Zhu
Abstract:
In this paper, we obtain fundamental $\mathcal{L}_{p}$ bounds in sequential prediction and recursive algorithms via an entropic analysis. Both classes of problems are examined by investigating the underlying entropic relationships of the data and/or noises involved, and the derived lower bounds may all be quantified in a conditional entropy characterization. We also study the conditions to achieve…
▽ More
In this paper, we obtain fundamental $\mathcal{L}_{p}$ bounds in sequential prediction and recursive algorithms via an entropic analysis. Both classes of problems are examined by investigating the underlying entropic relationships of the data and/or noises involved, and the derived lower bounds may all be quantified in a conditional entropy characterization. We also study the conditions to achieve the generic bounds from an innovations' viewpoint.
△ Less
Submitted 11 May, 2021; v1 submitted 3 December, 2019;
originally announced December 2019.
-
Distributed Microphone Speech Enhancement based on Deep Learning
Authors:
Syu-Siang Wang,
Yu-You Liang,
Jeih-weih Hung,
Yu Tsao,
Hsin-Min Wang,
Shih-Hau Fang
Abstract:
Speech-related applications deliver inferior performance in complex noise environments. Therefore, this study primarily addresses this problem by introducing speech-enhancement (SE) systems based on deep neural networks (DNNs) applied to a distributed microphone architecture, and then investigates the effectiveness of three different DNN-model structures. The first system constructs a DNN model fo…
▽ More
Speech-related applications deliver inferior performance in complex noise environments. Therefore, this study primarily addresses this problem by introducing speech-enhancement (SE) systems based on deep neural networks (DNNs) applied to a distributed microphone architecture, and then investigates the effectiveness of three different DNN-model structures. The first system constructs a DNN model for each microphone to enhance the recorded noisy speech signal, and the second system combines all the noisy recordings into a large feature structure that is then enhanced through a DNN model. As for the third system, a channel-dependent DNN is first used to enhance the corresponding noisy input, and all the channel-wise enhanced outputs are fed into a DNN fusion model to construct a nearly clean signal. All the three DNN SE systems are operated in the acoustic frequency domain of speech signals in a diffuse-noise field environment. Evaluation experiments were conducted on the Taiwan Mandarin Hearing in Noise Test (TMHINT) database, and the results indicate that all the three DNN-based SE systems provide the original noise-corrupted signals with improved speech quality and intelligibility, whereas the third system delivers the highest signal-to-noise ratio (SNR) improvement and optimal speech intelligibility.
△ Less
Submitted 24 May, 2020; v1 submitted 19 November, 2019;
originally announced November 2019.
-
Generic Bounds on the Maximum Deviations in Sequential Prediction: An Information-Theoretic Analysis
Authors:
Song Fang,
Quanyan Zhu
Abstract:
In this paper, we derive generic bounds on the maximum deviations in prediction errors for sequential prediction via an information-theoretic approach. The fundamental bounds are shown to depend only on the conditional entropy of the data point to be predicted given the previous data points. In the asymptotic case, the bounds are achieved if and only if the prediction error is white and uniformly…
▽ More
In this paper, we derive generic bounds on the maximum deviations in prediction errors for sequential prediction via an information-theoretic approach. The fundamental bounds are shown to depend only on the conditional entropy of the data point to be predicted given the previous data points. In the asymptotic case, the bounds are achieved if and only if the prediction error is white and uniformly distributed.
△ Less
Submitted 11 May, 2021; v1 submitted 11 October, 2019;
originally announced October 2019.
-
Two-Way Coding and Attack Decoupling in Control Systems Under Injection Attacks
Authors:
Song Fang,
Karl Henrik Johansson,
Mikael Skoglund,
Henrik Sandberg,
Hideaki Ishii
Abstract:
In this paper, we introduce the concept of two-way coding, which originates in communication theory characterizing coding schemes for two-way channels, into control theory, particularly to facilitate the analysis and design of feedback control systems under injection attacks. Moreover, we propose the notion of attack decoupling, and show how the controller and the two-way coding can be co-designed…
▽ More
In this paper, we introduce the concept of two-way coding, which originates in communication theory characterizing coding schemes for two-way channels, into control theory, particularly to facilitate the analysis and design of feedback control systems under injection attacks. Moreover, we propose the notion of attack decoupling, and show how the controller and the two-way coding can be co-designed to nullify the transfer function from attack to plant, rendering the attack effect zero both in transient phase and in steady state.
△ Less
Submitted 4 September, 2019;
originally announced September 2019.
-
Generic Variance Bounds on Estimation and Prediction Errors in Time Series Analysis: An Entropy Perspective
Authors:
Song Fang,
Mikael Skoglund,
Karl Henrik Johansson,
Hideaki Ishii,
Quanyan Zhu
Abstract:
In this paper, we obtain generic bounds on the variances of estimation and prediction errors in time series analysis via an information-theoretic approach. It is seen in general that the error bounds are determined by the conditional entropy of the data point to be estimated or predicted given the side information or past observations. Additionally, we discover that in order to achieve the predict…
▽ More
In this paper, we obtain generic bounds on the variances of estimation and prediction errors in time series analysis via an information-theoretic approach. It is seen in general that the error bounds are determined by the conditional entropy of the data point to be estimated or predicted given the side information or past observations. Additionally, we discover that in order to achieve the prediction error bounds asymptotically, the necessary and sufficient condition is that the "innovation" is asymptotically white Gaussian. When restricted to Gaussian processes and 1-step prediction, our bounds are shown to reduce to the Kolmogorov-Szegö formula and Wiener-Masani formula known from linear prediction theory.
△ Less
Submitted 11 May, 2021; v1 submitted 9 April, 2019;
originally announced April 2019.
-
Two-Way Coding in Control Systems Under Injection Attacks: From Attack Detection to Attack Correction
Authors:
Song Fang,
Karl Henrik Johansson,
Mikael Skoglund,
Henrik Sandberg,
Hideaki Ishii
Abstract:
In this paper, we introduce the method of two-way coding, a concept originating in communication theory characterizing coding schemes for two-way channels, into (networked) feedback control systems under injection attacks. We first show that the presence of two-way coding can distort the perspective of the attacker on the control system. In general, the distorted viewpoint on the attacker side as…
▽ More
In this paper, we introduce the method of two-way coding, a concept originating in communication theory characterizing coding schemes for two-way channels, into (networked) feedback control systems under injection attacks. We first show that the presence of two-way coding can distort the perspective of the attacker on the control system. In general, the distorted viewpoint on the attacker side as a consequence of two-way coding will facilitate detecting the attacks, or restricting what the attacker can do, or even correcting the attack effect. In the particular case of zero-dynamics attacks, if the attacks are to be designed according to the original plant, then they will be easily detected; while if the attacks are designed with respect to the equivalent plant as viewed by the attacker, then under the additional assumption that the plant is stabilizable by static output feedback, the attack effect may be corrected in steady state.
△ Less
Submitted 17 January, 2019; v1 submitted 16 January, 2019;
originally announced January 2019.
-
Robustness against the channel effect in pathological voice detection
Authors:
Yi-Te Hsu,
Zining Zhu,
Chi-Te Wang,
Shih-Hau Fang,
Frank Rudzicz,
Yu Tsao
Abstract:
Many people are suffering from voice disorders, which can adversely affect the quality of their lives. In response, some researchers have proposed algorithms for automatic assessment of these disorders, based on voice signals. However, these signals can be sensitive to the recording devices. Indeed, the channel effect is a pervasive problem in machine learning for healthcare. In this study, we pro…
▽ More
Many people are suffering from voice disorders, which can adversely affect the quality of their lives. In response, some researchers have proposed algorithms for automatic assessment of these disorders, based on voice signals. However, these signals can be sensitive to the recording devices. Indeed, the channel effect is a pervasive problem in machine learning for healthcare. In this study, we propose a detection system for pathological voice, which is robust against the channel effect. This system is based on a bidirectional LSTM network. To increase the performance robustness against channel mismatch, we integrate domain adversarial training (DAT) to eliminate the differences between the devices. When we train on data recorded on a high-quality microphone and evaluate on smartphone data without labels, our robust detection system increases the PR-AUC from 0.8448 to 0.9455 (and 0.9522 with target sample labels). To the best of our knowledge, this is the first study applying unsupervised domain adaptation to pathological voice detection. Notably, our system does not need target device sample labels, which allows for generalization to many new devices.
△ Less
Submitted 2 December, 2018; v1 submitted 26 November, 2018;
originally announced November 2018.
-
A Frequency-Domain Characterization of Optimal Error Covariance for the Kalman-Bucy Filter
Authors:
Song Fang,
Hideaki Ishii,
Jie Chen,
Karl Henrik Johansson
Abstract:
In this paper, we discover that the trace of the division of the optimal output estimation error covariance over the noise covariance attained by the Kalman-Bucy filter can be explicitly expressed in terms of the plant dynamics and noise statistics in a frequency-domain integral characterization. Towards this end, we examine the algebraic Riccati equation associated with Kalman-Bucy filtering usin…
▽ More
In this paper, we discover that the trace of the division of the optimal output estimation error covariance over the noise covariance attained by the Kalman-Bucy filter can be explicitly expressed in terms of the plant dynamics and noise statistics in a frequency-domain integral characterization. Towards this end, we examine the algebraic Riccati equation associated with Kalman-Bucy filtering using analytic function theory and relate it to the Bode integral. Our approach features an alternative, frequency-domain framework for analyzing algebraic Riccati equations and reduces to various existing related results.
△ Less
Submitted 23 July, 2018;
originally announced July 2018.
-
Adaptive Noise Cancellation Using Deep Cerebellar Model Articulation Controller
Authors:
Yu Tsao,
Hao-Chun Chu,
Shih-Wei Lan,
Shih-Hau Fang,
Junghsi Lee,
Chih-Min Lin
Abstract:
This paper proposes a deep cerebellar model articulation controller (DCMAC) for adaptive noise cancellation (ANC). We expand upon the conventional CMAC by stacking sin-gle-layer CMAC models into multiple layers to form a DCMAC model and derive a modified backpropagation training algorithm to learn the DCMAC parameters. Com-pared with conventional CMAC, the DCMAC can characterize nonlinear transfor…
▽ More
This paper proposes a deep cerebellar model articulation controller (DCMAC) for adaptive noise cancellation (ANC). We expand upon the conventional CMAC by stacking sin-gle-layer CMAC models into multiple layers to form a DCMAC model and derive a modified backpropagation training algorithm to learn the DCMAC parameters. Com-pared with conventional CMAC, the DCMAC can characterize nonlinear transformations more effectively because of its deep structure. Experimental results confirm that the pro-posed DCMAC model outperforms the CMAC in terms of residual noise in an ANC task, showing that DCMAC provides enhanced modeling capability based on channel characteristics.
△ Less
Submitted 2 May, 2017;
originally announced May 2017.
-
Three Laws of Multivariable Feedback Systems, Extended Spectral Flatness (Extended Wiener Entropy), 'Uncertainty Principles' in Variance Minimization, and Performance Limitations in Minimum Variance Estimation/Filtering
Authors:
Song Fang
Abstract:
In this paper, three laws are obtained for multiple-input multiple-output feedback systems, which are in entropy domain, frequency domain, and time domain, respectively. The system setup is that with causal plants and causal controllers. Those laws characterize the performance limitations of such systems imposed by the feedback mechanism. Some new notions are proposed to facilitate the analysis: n…
▽ More
In this paper, three laws are obtained for multiple-input multiple-output feedback systems, which are in entropy domain, frequency domain, and time domain, respectively. The system setup is that with causal plants and causal controllers. Those laws characterize the performance limitations of such systems imposed by the feedback mechanism. Some new notions are proposed to facilitate the analysis: negentropy rate, extended spectral flatness (extended Wiener entropy), Gaussianity-whiteness measure (joint Shannon-Wiener entropy), etc. Two approaches are adopted: the integrated approach and the divided approach. And 'uncertainty principles' are found in minimum variance control. Besides, performance limitations in minimum variance estimation and filtering are obtained. In the end, the special case of linear time-invariant feedback systems is discussed.
△ Less
Submitted 19 December, 2014; v1 submitted 1 December, 2014;
originally announced December 2014.
-
Limitations of state estimation: absolute lower bound of minimum variance estimation/filtering, Gaussianity-whiteness measure (joint Shannon-Wiener entropy), and Gaussianing-whitening filter (maximum Gaussianity-whiteness measure principle)
Authors:
Song Fang
Abstract:
This paper aims at obtaining performance limitations of state estimation in terms of variance minimization (minimum variance estimation and filtering) using information theory. Two new notions, negentropy rate and Gaussianity-whiteness measure (joint Shannon-Wiener entropy), are proposed to facilitate the analysis. Topics such as Gaussianing-whitening filter (the maximum Gaussianity-whiteness meas…
▽ More
This paper aims at obtaining performance limitations of state estimation in terms of variance minimization (minimum variance estimation and filtering) using information theory. Two new notions, negentropy rate and Gaussianity-whiteness measure (joint Shannon-Wiener entropy), are proposed to facilitate the analysis. Topics such as Gaussianing-whitening filter (the maximum Gaussianity-whiteness measure principle) are also discussed.
△ Less
Submitted 19 December, 2014; v1 submitted 4 November, 2014;
originally announced November 2014.