-
Cooperative Bistatic ISAC Systems for Low-Altitude Economy
Authors:
Zhenkun Zhang,
Yining Xu,
Cunhua Pan,
Hong Ren,
Yiming Yu,
Jiangzhou Wang
Abstract:
The burgeoning low-altitude economy (LAE) necessitates integrated sensing and communication (ISAC) systems capable of high-accuracy multi-target localization and velocity estimation under hardware and coverage constraints inherent in conventional ISAC architectures. This paper addresses these challenges by proposing a cooperative bistatic ISAC framework within MIMO-OFDM cellular networks, enabling…
▽ More
The burgeoning low-altitude economy (LAE) necessitates integrated sensing and communication (ISAC) systems capable of high-accuracy multi-target localization and velocity estimation under hardware and coverage constraints inherent in conventional ISAC architectures. This paper addresses these challenges by proposing a cooperative bistatic ISAC framework within MIMO-OFDM cellular networks, enabling robust sensing services for LAE applications through standardized 5G New Radio (NR) infrastructure. We first develop a low-complexity parameter extraction algorithm employing CANDECOMP/PARAFAC (CP) tensor decomposition, which exploits the inherent Vandermonde structure in delay-related factor matrices to efficiently recover bistatic ranges, Doppler velocities, and angles-of-arrival (AoA) from multi-dimensional received signal tensors. To resolve data association ambiguity across distributed transmitter-receiver pairs and mitigate erroneous estimates, we further design a robust fusion scheme based on the minimum spanning tree (MST) method, enabling joint 3D position and velocity reconstruction. Comprehensive simulation results validate the framework's superiority in computational efficiency and sensing performance for low-altitude scenarios.
△ Less
Submitted 22 June, 2025;
originally announced June 2025.
-
Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model
Authors:
Ailin Huang,
Bingxin Li,
Bruce Wang,
Boyong Wu,
Chao Yan,
Chengli Feng,
Heng Wang,
Hongyu Zhou,
Hongyuan Wang,
Jingbei Li,
Jianjian Sun,
Joanna Wang,
Mingrui Chen,
Peng Liu,
Ruihang Miao,
Shilei Jiang,
Tian Fei,
Wang You,
Xi Chen,
Xuerui Yang,
Yechang Huang,
Yuxiang Zhang,
Zheng Ge,
Zheng Gong,
Zhewei Huang
, et al. (51 additional authors not shown)
Abstract:
Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a du…
▽ More
Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks.
△ Less
Submitted 13 June, 2025; v1 submitted 10 June, 2025;
originally announced June 2025.
-
ADEPT: Adaptive Diffusion Environment for Policy Transfer Sim-to-Real
Authors:
Youwei Yu,
Junhong Xu,
Lantao Liu
Abstract:
Model-free reinforcement learning has emerged as a powerful method for developing robust robot control policies capable of navigating through complex and unstructured environments. The effectiveness of these methods hinges on two essential elements: (1) the use of massively parallel physics simulations to expedite policy training, and (2) an environment generator tasked with crafting sufficiently…
▽ More
Model-free reinforcement learning has emerged as a powerful method for developing robust robot control policies capable of navigating through complex and unstructured environments. The effectiveness of these methods hinges on two essential elements: (1) the use of massively parallel physics simulations to expedite policy training, and (2) an environment generator tasked with crafting sufficiently challenging yet attainable environments to facilitate continuous policy improvement. Existing methods of outdoor environment generation often rely on heuristics constrained by a set of parameters, limiting the diversity and realism. In this work, we introduce ADEPT, a novel \textbf{A}daptive \textbf{D}iffusion \textbf{E}nvironment for \textbf{P}olicy \textbf{T}ransfer in the zero-shot sim-to-real fashion that leverages Denoising Diffusion Probabilistic Models to dynamically expand existing training environments by adding more diverse and complex environments adaptive to the current policy. ADEPT guides the diffusion model's generation process through initial noise optimization, blending noise-corrupted environments from existing training environments weighted by the policy's performance in each corresponding environment. By manipulating the noise corruption level, ADEPT seamlessly transitions between generating similar environments for policy fine-tuning and novel ones to expand training diversity. To benchmark ADEPT in off-road navigation, we propose a fast and effective multi-layer map representation for wild environment generation. Our experiments show that the policy trained by ADEPT outperforms both procedural generated and natural environments, along with popular navigation methods.
△ Less
Submitted 4 June, 2025; v1 submitted 2 June, 2025;
originally announced June 2025.
-
A physics-guided smoothing method for material modeling with digital image correlation (DIC) measurements
Authors:
Jihong Wang,
Chung-Hao Lee,
William Richardson,
Yue Yu
Abstract:
In this work, we present a novel approach to process the DIC measurements of multiple biaxial stretching protocols. In particular, we develop a optimization-based approach, which calculates the smoothed nodal displacements using a moving least-squares algorithm subject to positive strain constraints. As such, physically consistent displacement and strain fields are obtained. Then, we further deplo…
▽ More
In this work, we present a novel approach to process the DIC measurements of multiple biaxial stretching protocols. In particular, we develop a optimization-based approach, which calculates the smoothed nodal displacements using a moving least-squares algorithm subject to positive strain constraints. As such, physically consistent displacement and strain fields are obtained. Then, we further deploy a data-driven workflow to heterogeneous material modeling from these physically consistent DIC measurements, by estimating a nonlocal constitutive law together with the material microstructure. To demonstrate the applicability of our approach, we apply it in learning a material model and fiber orientation field from DIC measurements of a porcine tricuspid valve anterior leaflet. Our results demonstrate that the proposed DIC data processing approach can significantly improve the accuracy of modeling biological materials.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving
Authors:
Jingran Xie,
Xiang Li,
Hui Wang,
Yue Yu,
Yang Xiang,
Xixin Wu,
Zhiyong Wu
Abstract:
Large language models (LLMs) have shown remarkable generalization across tasks, leading to increased interest in integrating speech with LLMs. These speech LLMs (SLLMs) typically use supervised fine-tuning to align speech with text-based LLMs. However, the lack of annotated speech data across a wide range of tasks hinders alignment efficiency, resulting in poor generalization. To address these iss…
▽ More
Large language models (LLMs) have shown remarkable generalization across tasks, leading to increased interest in integrating speech with LLMs. These speech LLMs (SLLMs) typically use supervised fine-tuning to align speech with text-based LLMs. However, the lack of annotated speech data across a wide range of tasks hinders alignment efficiency, resulting in poor generalization. To address these issues, we propose a novel multi-task 'behavior imitation' method with speech-text interleaving, called MTBI, which relies solely on paired speech and transcripts. By ensuring the LLM decoder generates equivalent responses to paired speech and text, we achieve a more generalized SLLM. Interleaving is used to further enhance alignment efficiency. We introduce a simple benchmark to evaluate prompt and task generalization across different models. Experimental results demonstrate that our MTBI outperforms SOTA SLLMs on both prompt and task generalization, while requiring less supervised speech data.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation
Authors:
Woohyun Cho,
Youngmin Kim,
Sunghyun Lee,
Youngjae Yu
Abstract:
Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation.…
▽ More
Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation. By integrating text, audio, and video, MAVL enables richer and more expressive translations than text-only approaches. Building on this, we propose Syllable-Constrained Audio-Video LLM with Chain-of-Thought SylAVL-CoT, which leverages audio-video cues and enforces syllabic constraints to produce natural-sounding lyrics. Experimental results demonstrate that SylAVL-CoT significantly outperforms text-based models in singability and contextual accuracy, emphasizing the value of multimodal, multilingual approaches for lyrics translation.
△ Less
Submitted 5 June, 2025; v1 submitted 24 May, 2025;
originally announced May 2025.
-
An Inclusive Foundation Model for Generalizable Cytogenetics in Precision Oncology
Authors:
Changchun Yang,
Weiqian Dai,
Yilan Zhang,
Siyuan Chen,
Jingdong Hu,
Junkai Su,
Yuxuan Chen,
Ao Xu,
Na Li,
Xin Gao,
Yongguo Yu
Abstract:
Chromosome analysis is vital for diagnosing genetic disorders and guiding cancer therapy decisions through the identification of somatic clonal aberrations. However, developing an AI model are hindered by the overwhelming complexity and diversity of chromosomal abnormalities, requiring extensive annotation efforts, while automated methods remain task-specific and lack generalizability due to the s…
▽ More
Chromosome analysis is vital for diagnosing genetic disorders and guiding cancer therapy decisions through the identification of somatic clonal aberrations. However, developing an AI model are hindered by the overwhelming complexity and diversity of chromosomal abnormalities, requiring extensive annotation efforts, while automated methods remain task-specific and lack generalizability due to the scarcity of comprehensive datasets spanning diverse resource conditions. Here, we introduce CHROMA, a foundation model for cytogenomics, designed to overcome these challenges by learning generalizable representations of chromosomal abnormalities. Pre-trained on over 84,000 specimens (~4 million chromosomal images) via self-supervised learning, CHROMA outperforms other methods across all types of abnormalities, even when trained on fewer labelled data and more imbalanced datasets. By facilitating comprehensive mapping of instability and clonal leisons across various aberration types, CHROMA offers a scalable and generalizable solution for reliable and automated clinical analysis, reducing the annotation workload for experts and advancing precision oncology through the early detection of rare genomic abnormalities, enabling broad clinical AI applications and making advanced genomic analysis more accessible.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation
Authors:
Yutong Liu,
Ziyue Zhang,
Ban Ma-bao,
Yuqing Cai,
Yongbin Yu,
Renzeng Duojie,
Xiangxiang Wang,
Fan Gao,
Cheng Huang,
Nyima Tashi
Abstract:
Tibetan is a low-resource language with minimal parallel speech corpora spanning its three major dialects-Ü-Tsang, Amdo, and Kham-limiting progress in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot, multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels. Our method features a…
▽ More
Tibetan is a low-resource language with minimal parallel speech corpora spanning its three major dialects-Ü-Tsang, Amdo, and Kham-limiting progress in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot, multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels. Our method features a novel speaker-dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects while preserving speaker identity. Extensive objective and subjective evaluations demonstrate that FMSD-TTS significantly outperforms baselines in both dialectal expressiveness and speaker similarity. We further validate the quality and utility of the synthesized speech through a challenging speech-to-speech dialect conversion task. Our contributions include: (1) a novel few-shot TTS system tailored for Tibetan multi-dialect speech synthesis, (2) the public release of a large-scale synthetic Tibetan speech corpus generated by FMSD-TTS, and (3) an open-source evaluation toolkit for standardized assessment of speaker similarity, dialect consistency, and audio quality.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
A Mamba-based Network for Semi-supervised Singing Melody Extraction Using Confidence Binary Regularization
Authors:
Xiaoliang He,
Kangjie Dong,
Jingkai Cao,
Shuai Yu,
Wei Li,
Yi Yu
Abstract:
Singing melody extraction (SME) is a key task in the field of music information retrieval. However, existing methods are facing several limitations: firstly, prior models use transformers to capture the contextual dependencies, which requires quadratic computation resulting in low efficiency in the inference stage. Secondly, prior works typically rely on frequencysupervised methods to estimate the…
▽ More
Singing melody extraction (SME) is a key task in the field of music information retrieval. However, existing methods are facing several limitations: firstly, prior models use transformers to capture the contextual dependencies, which requires quadratic computation resulting in low efficiency in the inference stage. Secondly, prior works typically rely on frequencysupervised methods to estimate the fundamental frequency (f0), which ignores that the musical performance is actually based on notes. Thirdly, transformers typically require large amounts of labeled data to achieve optimal performances, but the SME task lacks of sufficient annotated data. To address these issues, in this paper, we propose a mamba-based network, called SpectMamba, for semi-supervised singing melody extraction using confidence binary regularization. In particular, we begin by introducing vision mamba to achieve computational linear complexity. Then, we propose a novel note-f0 decoder that allows the model to better mimic the musical performance. Further, to alleviate the scarcity of the labeled data, we introduce a confidence binary regularization (CBR) module to leverage the unlabeled data by maximizing the probability of the correct classes. The proposed method is evaluated on several public datasets and the conducted experiments demonstrate the effectiveness of our proposed method.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Efficient Information Updates in Compute-First Networking via Reinforcement Learning with Joint AoI and VoI
Authors:
Jianpeng Qi,
Chao Liu,
Chengxiang Xu,
Rui Wang,
Junyu Dong,
Yanwei Yu
Abstract:
Timely and efficient dissemination of service information is critical in compute-first networking systems, where user requests arrive dynamically and computing resources are constrained. In such systems, the access point (AP) plays a key role in forwarding user requests to a server based on its latest received service information. This paper considers a single-source, single-destination system and…
▽ More
Timely and efficient dissemination of service information is critical in compute-first networking systems, where user requests arrive dynamically and computing resources are constrained. In such systems, the access point (AP) plays a key role in forwarding user requests to a server based on its latest received service information. This paper considers a single-source, single-destination system and introduces an Age-and-Value-Aware (AVA) metric that jointly captures both the timeliness and the task relevance of service information. Unlike traditional freshness-based metrics, AVA explicitly incorporates variations in server-side service capacity and AP forwarding decisions, allowing more context-aware update evaluation. Building upon AVA, we propose a reinforcement learning-based update policy that learns to selectively transmit service information updates to the AP. It aims to maximize overall task success while minimizing unnecessary communications. Extensive simulations under diverse user request patterns and varying service capacities demonstrate that AVA reduces the update frequency by over 90% on average compared to baselines, with reductions reaching 98% in certain configurations. Crucially, this reduction is achieved without compromising the accuracy of task execution or the quality of decision making.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Field-scale soil moisture estimated from Sentinel-1 SAR data using a knowledge-guided deep learning approach
Authors:
Yi Yu,
Patrick Filippi,
Thomas F. A. Bishop
Abstract:
Soil moisture (SM) estimation from active microwave data remains challenging due to the complex interactions between radar backscatter and surface characteristics. While the water cloud model (WCM) provides a semi-physical approach for understanding these interactions, its empirical component often limits performance across diverse agricultural landscapes. This research presents preliminary effort…
▽ More
Soil moisture (SM) estimation from active microwave data remains challenging due to the complex interactions between radar backscatter and surface characteristics. While the water cloud model (WCM) provides a semi-physical approach for understanding these interactions, its empirical component often limits performance across diverse agricultural landscapes. This research presents preliminary efforts for developing a knowledge-guided deep learning approach, which integrates WCM principles into a long short-term memory (LSTM) model, to estimate field SM using Sentinel-1 Synthetic Aperture Radar (SAR) data. Our proposed approach leverages LSTM's capacity to capture spatiotemporal dependencies while maintaining physical consistency through a modified dual-component loss function, including a WCM-based semi-physical component and a boundary condition regularisation. The proposed approach is built upon the soil backscatter coefficients isolated from the total backscatter, together with Landsat-resolution vegetation information and surface characteristics. A four-fold spatial cross-validation was performed against in-situ SM data to assess the model performance. Results showed the proposed approach reduced SM retrieval uncertainties by 0.02 m$^3$/m$^3$ and achieved correlation coefficients (R) of up to 0.64 in areas with varying vegetation cover and surface conditions, demonstrating the potential to address the over-simplification in WCM.
△ Less
Submitted 30 April, 2025;
originally announced May 2025.
-
The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report
Authors:
Bin Ren,
Hang Guo,
Lei Sun,
Zongwei Wu,
Radu Timofte,
Yawei Li,
Yao Zhang,
Xinning Chai,
Zhengxue Cheng,
Yingsheng Qin,
Yucai Yang,
Li Song,
Hongyuan Yu,
Pufan Xu,
Cheng Wan,
Zhijuan Huang,
Peng Guo,
Shuyuan Cui,
Chenjun Li,
Xuehai Hu,
Pan Pan,
Xin Zhang,
Heng Zhang,
Qing Luo,
Linyan Jiang
, et al. (122 additional authors not shown)
Abstract:
This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the…
▽ More
This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the $\operatorname{DIV2K\_LSDIR\_test}$ dataset. A robust participation saw \textbf{244} registered entrants, with \textbf{43} teams submitting valid entries. This report meticulously analyzes these methods and results, emphasizing groundbreaking advancements in state-of-the-art single-image ESR techniques. The analysis highlights innovative approaches and establishes benchmarks for future research in the field.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
AMNet: An Acoustic Model Network for Enhanced Mandarin Speech Synthesis
Authors:
Yubing Cao,
Yinfeng Yu,
Yongming Li,
Liejun Wang
Abstract:
This paper presents AMNet, an Acoustic Model Network designed to improve the performance of Mandarin speech synthesis by incorporating phrase structure annotation and local convolution modules. AMNet builds upon the FastSpeech 2 architecture while addressing the challenge of local context modeling, which is crucial for capturing intricate speech features such as pauses, stress, and intonation. By…
▽ More
This paper presents AMNet, an Acoustic Model Network designed to improve the performance of Mandarin speech synthesis by incorporating phrase structure annotation and local convolution modules. AMNet builds upon the FastSpeech 2 architecture while addressing the challenge of local context modeling, which is crucial for capturing intricate speech features such as pauses, stress, and intonation. By embedding a phrase structure parser into the model and introducing a local convolution module, AMNet enhances the model's sensitivity to local information. Additionally, AMNet decouples tonal characteristics from phonemes, providing explicit guidance for tone modeling, which improves tone accuracy and pronunciation. Experimental results demonstrate that AMNet outperforms baseline models in subjective and objective evaluations. The proposed model achieves superior Mean Opinion Scores (MOS), lower Mel Cepstral Distortion (MCD), and improved fundamental frequency fitting $F0 (R^2)$, confirming its ability to generate high-quality, natural, and expressive Mandarin speech.
△ Less
Submitted 12 April, 2025;
originally announced April 2025.
-
Neural Fidelity Calibration for Informative Sim-to-Real Adaptation
Authors:
Youwei Yu,
Lantao Liu
Abstract:
Deep reinforcement learning can seamlessly transfer agile locomotion and navigation skills from the simulator to real world. However, bridging the sim-to-real gap with domain randomization or adversarial methods often demands expert physics knowledge to ensure policy robustness. Even so, cutting-edge simulators may fall short of capturing every real-world detail, and the reconstructed environment…
▽ More
Deep reinforcement learning can seamlessly transfer agile locomotion and navigation skills from the simulator to real world. However, bridging the sim-to-real gap with domain randomization or adversarial methods often demands expert physics knowledge to ensure policy robustness. Even so, cutting-edge simulators may fall short of capturing every real-world detail, and the reconstructed environment may introduce errors due to various perception uncertainties. To address these challenges, we propose Neural Fidelity Calibration (NFC), a novel framework that employs conditional score-based diffusion models to calibrate simulator physical coefficients and residual fidelity domains online during robot execution. Specifically, the residual fidelity reflects the simulation model shift relative to the real-world dynamics and captures the uncertainty of the perceived environment, enabling us to sample realistic environments under the inferred distribution for policy fine-tuning. Our framework is informative and adaptive in three key ways: (a) we fine-tune the pretrained policy only under anomalous scenarios, (b) we build sequential NFC online with the pretrained NFC's proposal prior, reducing the diffusion model's training burden, and (c) when NFC uncertainty is high and may degrade policy improvement, we leverage optimistic exploration to enable hallucinated policy optimization. Our framework achieves superior simulator calibration precision compared to state-of-the-art methods across diverse robots with high-dimensional parametric spaces. We study the critical contribution of residual fidelity to policy improvement in simulation and real-world experiments. Notably, our approach demonstrates robust robot navigation under challenging real-world conditions, such as a broken wheel axle on snowy surfaces.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
Leveraging Label Potential for Enhanced Multimodal Emotion Recognition
Authors:
Xuechun Shao,
Yinfeng Yu,
Liejun Wang
Abstract:
Multimodal emotion recognition (MER) seeks to integrate various modalities to predict emotional states accurately. However, most current research focuses solely on the fusion of audio and text features, overlooking the valuable information in emotion labels. This oversight could potentially hinder the performance of existing methods, as emotion labels harbor rich, insightful information that could…
▽ More
Multimodal emotion recognition (MER) seeks to integrate various modalities to predict emotional states accurately. However, most current research focuses solely on the fusion of audio and text features, overlooking the valuable information in emotion labels. This oversight could potentially hinder the performance of existing methods, as emotion labels harbor rich, insightful information that could significantly aid MER. We introduce a novel model called Label Signal-Guided Multimodal Emotion Recognition (LSGMER) to overcome this limitation. This model aims to fully harness the power of emotion label information to boost the classification accuracy and stability of MER. Specifically, LSGMER employs a Label Signal Enhancement module that optimizes the representation of modality features by interacting with audio and text features through label embeddings, enabling it to capture the nuances of emotions precisely. Furthermore, we propose a Joint Objective Optimization(JOO) approach to enhance classification accuracy by introducing the Attribution-Prediction Consistency Constraint (APC), which strengthens the alignment between fused features and emotion categories. Extensive experiments conducted on the IEMOCAP and MELD datasets have demonstrated the effectiveness of our proposed LSGMER model.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
Multi-Agent Deep Reinforcement Learning for Multiple Anesthetics Collaborative Control
Authors:
Huijie Li,
Yide Yu,
Si Shi,
Anmin Hu,
Jian Huo,
Wei Lin,
Chaoran Wu,
Wuman Luo
Abstract:
Automated control of personalized multiple anesthetics in clinical Total Intravenous Anesthesia (TIVA) is crucial yet challenging. Current systems, including target-controlled infusion (TCI) and closed-loop systems, either rely on relatively static pharmacokinetic/pharmacodynamic (PK/PD) models or focus on single anesthetic control, limiting personalization and collaborative control. To address th…
▽ More
Automated control of personalized multiple anesthetics in clinical Total Intravenous Anesthesia (TIVA) is crucial yet challenging. Current systems, including target-controlled infusion (TCI) and closed-loop systems, either rely on relatively static pharmacokinetic/pharmacodynamic (PK/PD) models or focus on single anesthetic control, limiting personalization and collaborative control. To address these issues, we propose a novel framework, Value Decomposition Multi-Agent Deep Reinforcement Learning (VD-MADRL). VD-MADRL optimizes the collaboration between two anesthetics propofol (Agent I) and remifentanil (Agent II). And It uses a Markov Game (MG) to identify optimal actions among heterogeneous agents. We employ various value function decomposition methods to resolve the credit allocation problem and enhance collaborative control. We also introduce a multivariate environment model based on random forest (RF) for anesthesia state simulation. Additionally, a data resampling and alignment technique ensures synchronized trajectory data. Our experiments on general and thoracic surgery datasets show that VD-MADRL performs better than human experience. It improves dose precision and keeps anesthesia states stable, providing great clinical value.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
Process Optimization and Deployment for Sensor-Based Human Activity Recognition Based on Deep Learning
Authors:
Hanyu Liu,
Ying Yu,
Hang Xiao,
Siyao Li,
Xuze Li,
Jiarui Li,
Haotian Tang
Abstract:
Sensor-based human activity recognition is a key technology for many human-centered intelligent applications. However, this research is still in its infancy and faces many unresolved challenges. To address these, we propose a comprehensive optimization process approach centered on multi-attention interaction. We first utilize unsupervised statistical feature-guided diffusion models for highly adap…
▽ More
Sensor-based human activity recognition is a key technology for many human-centered intelligent applications. However, this research is still in its infancy and faces many unresolved challenges. To address these, we propose a comprehensive optimization process approach centered on multi-attention interaction. We first utilize unsupervised statistical feature-guided diffusion models for highly adaptive data enhancement, and introduce a novel network architecture-Multi-branch Spatiotemporal Interaction Network, which uses multi-branch features at different levels to effectively Sequential ), which uses multi-branch features at different levels to effectively Sequential spatio-temporal interaction to enhance the ability to mine advanced latent features. In addition, we adopt a multi-loss function fusion strategy in the training phase to dynamically adjust the fusion weights between batches to optimize the training results. Finally, we also conducted actual deployment on embedded devices to extensively test the practical feasibility of the proposed method in existing work. We conduct extensive testing on three public datasets, including ablation studies, comparisons of related work, and embedded deployments.
△ Less
Submitted 22 March, 2025;
originally announced April 2025.
-
SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System
Authors:
Hyeongju Kim,
Jinhyeok Yang,
Yechan Yu,
Seunghun Ji,
Jacob Morton,
Frederik Bous,
Joon Byun,
Juheon Lee
Abstract:
We present a novel text-to-speech (TTS) system, namely SupertonicTTS, for improved scalability and efficiency in speech synthesis. SupertonicTTS comprises three components: a speech autoencoder for continuous latent representation, a text-to-latent module leveraging flow-matching for text-to-latent mapping, and an utterance-level duration predictor. To enable a lightweight architecture, we employ…
▽ More
We present a novel text-to-speech (TTS) system, namely SupertonicTTS, for improved scalability and efficiency in speech synthesis. SupertonicTTS comprises three components: a speech autoencoder for continuous latent representation, a text-to-latent module leveraging flow-matching for text-to-latent mapping, and an utterance-level duration predictor. To enable a lightweight architecture, we employ a low-dimensional latent space, temporal compression of latents, and ConvNeXt blocks. We further simplify the TTS pipeline by operating directly on raw character-level text and employing cross-attention for text-speech alignment, thus eliminating the need for grapheme-to-phoneme (G2P) modules and external aligners. In addition, we introduce context-sharing batch expansion that accelerates loss convergence and stabilizes text-speech alignment. Experimental results demonstrate that SupertonicTTS achieves competitive performance while significantly reducing architectural complexity and computational overhead compared to contemporary TTS models. Audio samples demonstrating the capabilities of SupertonicTTS are available at: https://supertonictts.github.io/.
△ Less
Submitted 16 May, 2025; v1 submitted 29 March, 2025;
originally announced March 2025.
-
Magnitude-Phase Dual-Path Speech Enhancement Network based on Self-Supervised Embedding and Perceptual Contrast Stretch Boosting
Authors:
Alimjan Mattursun,
Liejun Wang,
Yinfeng Yu,
Chunyang Ma
Abstract:
Speech self-supervised learning (SSL) has made great progress in various speech processing tasks, but there is still room for improvement in speech enhancement (SE). This paper presents BSP-MPNet, a dual-path framework that combines self-supervised features with magnitude-phase information for SE. The approach starts by applying the perceptual contrast stretching (PCS) algorithm to enhance the mag…
▽ More
Speech self-supervised learning (SSL) has made great progress in various speech processing tasks, but there is still room for improvement in speech enhancement (SE). This paper presents BSP-MPNet, a dual-path framework that combines self-supervised features with magnitude-phase information for SE. The approach starts by applying the perceptual contrast stretching (PCS) algorithm to enhance the magnitude-phase spectrum. A magnitude-phase 2D coarse (MP-2DC) encoder then extracts coarse features from the enhanced spectrum. Next, a feature-separating self-supervised learning (FS-SSL) model generates self-supervised embeddings for the magnitude and phase components separately. These embeddings are fused to create cross-domain feature representations. Finally, two parallel RNN-enhanced multi-attention (REMA) mask decoders refine the features, apply them to the mask, and reconstruct the speech signal. We evaluate BSP-MPNet on the VoiceBank+DEMAND and WHAMR! datasets. Experimental results show that BSP-MPNet outperforms existing methods under various noise conditions, providing new directions for self-supervised speech enhancement research. The implementation of the BSP-MPNet code is available online\footnote[2]{https://github.com/AlimMat/BSP-MPNet. \label{s1}}
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Semi-Gradient SARSA Routing with Theoretical Guarantee on Traffic Stability and Weight Convergence
Authors:
Yidan Wu,
Yu Yu,
Jianan Zhang,
Li Jin
Abstract:
We consider the traffic control problem of dynamic routing over parallel servers, which arises in a variety of engineering systems such as transportation and data transmission. We propose a semi-gradient, on-policy algorithm that learns an approximate optimal routing policy. The algorithm uses generic basis functions with flexible weights to approximate the value function across the unbounded stat…
▽ More
We consider the traffic control problem of dynamic routing over parallel servers, which arises in a variety of engineering systems such as transportation and data transmission. We propose a semi-gradient, on-policy algorithm that learns an approximate optimal routing policy. The algorithm uses generic basis functions with flexible weights to approximate the value function across the unbounded state space. Consequently, the training process lacks Lipschitz continuity of the gradient, boundedness of the temporal-difference error, and a prior guarantee on ergodicity, which are the standard prerequisites in existing literature on reinforcement learning theory. To address this, we combine a Lyapunov approach and an ordinary differential equation-based method to jointly characterize the behavior of traffic state and approximation weights. Our theoretical analysis proves that the training scheme guarantees traffic state stability and ensures almost surely convergence of the weights to the approximate optimum. We also demonstrate via simulations that our algorithm attains significantly faster convergence than neural network-based methods with an insignificant approximation error.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
Submillimeter-Accurate 3D Lumbar Spine Reconstruction from Biplanar X-Ray Images: Incorporating a Multi-Task Network and Landmark-Weighted Loss
Authors:
Wanxin Yu,
Zhemin Zhu,
Cong Wang,
Yihang Bao,
Chunjie Xia,
Rongshan Cheng,
Yan Yu,
Tsung-Yuan Tsai
Abstract:
Three-dimensional reconstruction of the spine under weight-bearing conditions from biplanar X-ray images is of great importance for the clinical assessment of spinal diseases. However, the current fully automated reconstruction methods only achieve millimeter-level accuracy, making it difficult to meet clinical standards. This study developed and validated a fully automated method for high-accurac…
▽ More
Three-dimensional reconstruction of the spine under weight-bearing conditions from biplanar X-ray images is of great importance for the clinical assessment of spinal diseases. However, the current fully automated reconstruction methods only achieve millimeter-level accuracy, making it difficult to meet clinical standards. This study developed and validated a fully automated method for high-accuracy 3D reconstruction of the lumbar spine from biplanar X-ray images. The method involves lumbar decomposition and landmark detection from the raw X-ray images, followed by a deformable model and landmark-weighted 2D-3D registration approach. The reconstruction accuracy was validated by the gold standard obtained through the registration of CT-segmented vertebral models with the biplanar X-ray images. The proposed method achieved a 3D reconstruction accuracy of 0.80mm, representing a significant improvement over the mainstream approaches. This study will contribute to the clinical diagnosis of lumbar in weight-bearing positions.
△ Less
Submitted 18 May, 2025; v1 submitted 18 March, 2025;
originally announced March 2025.
-
Electromagnetic Information Theory: Fundamentals, Paradigm Shifts, and Applications
Authors:
Tengjiao Wang,
Zhenyu Kang,
Ting Li,
Zhihui Chen,
Shaobo Wang,
Yingpei Lin,
Yan Wang,
Yichuan Yu
Abstract:
This paper explores the emerging research direction of electromagnetic information theory (EIT), which aims to integrate traditional Shannon-based methodologies with physical consistency, particularly the electromagnetic properties of communication channels. We propose an EIT-based multiple-input multiple-output (MIMO) paradigm that enhances conventional spatially-discrete MIMO models by incorpora…
▽ More
This paper explores the emerging research direction of electromagnetic information theory (EIT), which aims to integrate traditional Shannon-based methodologies with physical consistency, particularly the electromagnetic properties of communication channels. We propose an EIT-based multiple-input multiple-output (MIMO) paradigm that enhances conventional spatially-discrete MIMO models by incorporating the concepts of electromagnetic (EM) precoding and EM combining. This approach aims to improve the modeling of next-generation systems while remaining consistent with Shannon's theoretical foundations. We explore typical EIT applications, such as densely spaced MIMO, near-field communications, and tri-polarized antennas, and analyze their channel characteristics through theoretical simulations and measured datasets. The paper also discusses critical research challenges and opportunities for EIT applications from an industrial perspective, emphasizing the field's potential for practical applications.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
When Large Language Models Meet Speech: A Survey on Integration Approaches
Authors:
Zhengdong Yang,
Shuichiro Shimizu,
Yahan Yu,
Chenhui Chu
Abstract:
Recent advancements in large language models (LLMs) have spurred interest in expanding their application beyond text-based tasks. A large number of studies have explored integrating other modalities with LLMs, notably speech modality, which is naturally related to text. This paper surveys the integration of speech with LLMs, categorizing the methodologies into three primary approaches: text-based,…
▽ More
Recent advancements in large language models (LLMs) have spurred interest in expanding their application beyond text-based tasks. A large number of studies have explored integrating other modalities with LLMs, notably speech modality, which is naturally related to text. This paper surveys the integration of speech with LLMs, categorizing the methodologies into three primary approaches: text-based, latent-representation-based, and audio-token-based integration. We also demonstrate how these methods are applied across various speech-related applications and highlight the challenges in this field to offer inspiration for
△ Less
Submitted 26 February, 2025;
originally announced February 2025.
-
Safe Beyond the Horizon: Efficient Sampling-based MPC with Neural Control Barrier Functions
Authors:
Ji Yin,
Oswin So,
Eric Yang Yu,
Chuchu Fan,
Panagiotis Tsiotras
Abstract:
A common problem when using model predictive control (MPC) in practice is the satisfaction of safety specifications beyond the prediction horizon. While theoretical works have shown that safety can be guaranteed by enforcing a suitable terminal set constraint or a sufficiently long prediction horizon, these techniques are difficult to apply and thus are rarely used by practitioners, especially in…
▽ More
A common problem when using model predictive control (MPC) in practice is the satisfaction of safety specifications beyond the prediction horizon. While theoretical works have shown that safety can be guaranteed by enforcing a suitable terminal set constraint or a sufficiently long prediction horizon, these techniques are difficult to apply and thus are rarely used by practitioners, especially in the case of general nonlinear dynamics. To solve this problem, we impose a tradeoff between exact recursive feasibility, computational tractability, and applicability to ``black-box'' dynamics by learning an approximate discrete-time control barrier function and incorporating it into a variational inference MPC (VIMPC), a sampling-based MPC paradigm. To handle the resulting state constraints, we further propose a new sampling strategy that greatly reduces the variance of the estimated optimal control, improving the sample efficiency, and enabling real-time planning on a CPU. The resulting Neural Shield-VIMPC (NS-VIMPC) controller yields substantial safety improvements compared to existing sampling-based MPC controllers, even under badly designed cost functions. We validate our approach in both simulation and real-world hardware experiments. Project website: https://mit-realm.github.io/ns-vimpc/.
△ Less
Submitted 8 July, 2025; v1 submitted 20 February, 2025;
originally announced February 2025.
-
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Authors:
Ailin Huang,
Boyong Wu,
Bruce Wang,
Chao Yan,
Chen Hu,
Chengli Feng,
Fei Tian,
Feiyu Shen,
Jingbei Li,
Mingrui Chen,
Peng Liu,
Ruihang Miao,
Wang You,
Xi Chen,
Xuerui Yang,
Yechang Huang,
Yuxiang Zhang,
Zheng Gong,
Zixin Zhang,
Hongyu Zhou,
Jianjian Sun,
Brian Li,
Chengting Feng,
Changyi Wan,
Hanpeng Hu
, et al. (120 additional authors not shown)
Abstract:
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contribu…
▽ More
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
△ Less
Submitted 18 February, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
Leveraging Chain of Thought towards Empathetic Spoken Dialogue without Corresponding Question-Answering Data
Authors:
Jingran Xie,
Shun Lei,
Yue Yu,
Yang Xiang,
Hui Wang,
Xixin Wu,
Zhiyong Wu
Abstract:
Empathetic dialogue is crucial for natural human-computer interaction, allowing the dialogue system to respond in a more personalized and emotionally aware manner, improving user satisfaction and engagement. The emergence of large language models (LLMs) has revolutionized dialogue generation by harnessing their powerful capabilities and shown its potential in multimodal domains. Many studies have…
▽ More
Empathetic dialogue is crucial for natural human-computer interaction, allowing the dialogue system to respond in a more personalized and emotionally aware manner, improving user satisfaction and engagement. The emergence of large language models (LLMs) has revolutionized dialogue generation by harnessing their powerful capabilities and shown its potential in multimodal domains. Many studies have integrated speech with text-based LLMs to take speech question as input and output text response. However, the lack of spoken question-answering datasets that include speech style information to supervised fine-tuning (SFT) limits the performance of these systems. As a result, while these systems excel at understanding speech content, they often struggle to generate empathetic responses. In response, we propose a novel approach that circumvents the need for question-answering data, called Listen, Perceive, and Express (LPE). Our method employs a two-stage training process, initially guiding the LLM to listen the content and perceive the emotional aspects of speech. Subsequently, we utilize Chain-of-Thought (CoT) prompting to unlock the model's potential for expressing empathetic responses based on listened spoken content and perceived emotional cues. We employ experiments to prove the effectiveness of proposed method. To our knowledge, this is the first attempt to leverage CoT for speech-based dialogue.
△ Less
Submitted 18 January, 2025;
originally announced January 2025.
-
SynthSoM: A synthetic intelligent multi-modal sensing-communication dataset for Synesthesia of Machines (SoM)
Authors:
Xiang Cheng,
Ziwei Huang,
Yong Yu,
Lu Bai,
Mingran Sun,
Zengrui Han,
Ruide Zhang,
Sijiang Li
Abstract:
Given the importance of datasets for sensing-communication integration research, a novel simulation platform for constructing communication and multi-modal sensory dataset is developed. The developed platform integrates three high-precision software, i.e., AirSim, WaveFarer, and Wireless InSite, and further achieves in-depth integration and precise alignment of them. Based on the developed platfor…
▽ More
Given the importance of datasets for sensing-communication integration research, a novel simulation platform for constructing communication and multi-modal sensory dataset is developed. The developed platform integrates three high-precision software, i.e., AirSim, WaveFarer, and Wireless InSite, and further achieves in-depth integration and precise alignment of them. Based on the developed platform, a new synthetic intelligent multi-modal sensing-communication dataset for Synesthesia of Machines (SoM), named SynthSoM, is proposed. The SynthSoM dataset contains various air-ground multi-link cooperative scenarios with comprehensive conditions, including multiple weather conditions, times of the day, intelligent agent densities, frequency bands, and antenna types. The SynthSoM dataset encompasses multiple data modalities, including radio-frequency (RF) channel large-scale and small-scale fading data, RF millimeter wave (mmWave) radar sensory data, and non-RF sensory data, e.g., RGB images, depth maps, and light detection and ranging (LiDAR) point clouds. The quality of SynthSoM dataset is validated via statistics-based qualitative inspection and evaluation metrics through machine learning (ML) via real-world measurements. The SynthSoM dataset is open-sourced and provides consistent data for cross-comparing SoM-related algorithms.
△ Less
Submitted 24 April, 2025; v1 submitted 13 January, 2025;
originally announced January 2025.
-
Modality-Invariant Bidirectional Temporal Representation Distillation Network for Missing Multimodal Sentiment Analysis
Authors:
Xincheng Wang,
Liejun Wang,
Yinfeng Yu,
Xinxin Jiao
Abstract:
Multimodal Sentiment Analysis (MSA) integrates diverse modalities(text, audio, and video) to comprehensively analyze and understand individuals' emotional states. However, the real-world prevalence of incomplete data poses significant challenges to MSA, mainly due to the randomness of modality missing. Moreover, the heterogeneity issue in multimodal data has yet to be effectively addressed. To tac…
▽ More
Multimodal Sentiment Analysis (MSA) integrates diverse modalities(text, audio, and video) to comprehensively analyze and understand individuals' emotional states. However, the real-world prevalence of incomplete data poses significant challenges to MSA, mainly due to the randomness of modality missing. Moreover, the heterogeneity issue in multimodal data has yet to be effectively addressed. To tackle these challenges, we introduce the Modality-Invariant Bidirectional Temporal Representation Distillation Network (MITR-DNet) for Missing Multimodal Sentiment Analysis. MITR-DNet employs a distillation approach, wherein a complete modality teacher model guides a missing modality student model, ensuring robustness in the presence of modality missing. Simultaneously, we developed the Modality-Invariant Bidirectional Temporal Representation Learning Module (MIB-TRL) to mitigate heterogeneity.
△ Less
Submitted 7 January, 2025;
originally announced January 2025.
-
Cooperative ISAC-empowered Low-Altitude Economy
Authors:
Jun Tang,
Yiming Yu,
Cunhua Pan,
Hong Ren,
Dongming Wang,
Jiangzhou Wang,
Xiaohu You
Abstract:
This paper proposes a cooperative integrated sensing and communication (ISAC) scheme for the low-altitude sensing scenario, aiming at estimating the parameters of the unmanned aerial vehicles (UAVs) and enhancing the sensing performance via cooperation. The proposed scheme consists of two stages. In Stage I, we formulate the monostatic parameter estimation problem via using a tensor decomposition…
▽ More
This paper proposes a cooperative integrated sensing and communication (ISAC) scheme for the low-altitude sensing scenario, aiming at estimating the parameters of the unmanned aerial vehicles (UAVs) and enhancing the sensing performance via cooperation. The proposed scheme consists of two stages. In Stage I, we formulate the monostatic parameter estimation problem via using a tensor decomposition model. By leveraging the Vandermonde structure of the factor matrix, a spatial smoothing tensor decomposition scheme is introduced to estimate the UAVs' parameters. To further reduce the computational complexity, we design a reduced-dimensional (RD) angle of arrival (AoA) estimation algorithm based on generalized Rayleigh quotient (GRQ). In Stage II, the positions and true velocities of the UAVs are determined through the data fusion across multiple base stations (BSs). Specifically, we first develop a false removing minimum spanning tree (MST)-based data association method to accurately match the BSs' parameter estimations to the same UAV. Then, a Pareto optimality method and a residual weighting scheme are developed to facilitate the position and velocity estimation, respectively. We further extend our approach to the dual-polarized system. Simulation results validate the effectiveness of the proposed schemes in comparison to the conventional techniques.
△ Less
Submitted 29 December, 2024;
originally announced December 2024.
-
MobileNetV2: A lightweight classification model for home-based sleep apnea screening
Authors:
Hui Pan,
Yanxuan Yu,
Jilun Ye,
Xu Zhang
Abstract:
This study proposes a novel lightweight neural network model leveraging features extracted from electrocardiogram (ECG) and respiratory signals for early OSA screening. ECG signals are used to generate feature spectrograms to predict sleep stages, while respiratory signals are employed to detect sleep-related breathing abnormalities. By integrating these predictions, the method calculates the apne…
▽ More
This study proposes a novel lightweight neural network model leveraging features extracted from electrocardiogram (ECG) and respiratory signals for early OSA screening. ECG signals are used to generate feature spectrograms to predict sleep stages, while respiratory signals are employed to detect sleep-related breathing abnormalities. By integrating these predictions, the method calculates the apnea-hypopnea index (AHI) with enhanced accuracy, facilitating precise OSA diagnosis.
The method was validated on three publicly available sleep apnea databases: the Apnea-ECG database, the UCDDB dataset, and the MIT-BIH Polysomnographic database. Results showed an overall OSA detection accuracy of 0.978, highlighting the model's robustness. Respiratory event classification achieved an accuracy of 0.969 and an area under the receiver operating characteristic curve (ROC-AUC) of 0.98. For sleep stage classification, in UCDDB dataset, the ROC-AUC exceeded 0.85 across all stages, with recall for Sleep reaching 0.906 and specificity for REM and Wake states at 0.956 and 0.937, respectively.
This study underscores the potential of integrating lightweight neural networks with multi-signal analysis for accurate, portable, and cost-effective OSA screening, paving the way for broader adoption in home-based and wearable health monitoring systems.
△ Less
Submitted 3 January, 2025; v1 submitted 27 December, 2024;
originally announced December 2024.
-
A Model-free Biomimetics Algorithm for Deterministic Partially Observable Markov Decision Process
Authors:
Yide Yu,
Yue Liu,
Xiaochen Yuan,
Dennis Wong,
Huijie Li,
Yan Ma
Abstract:
Partially Observable Markov Decision Process (POMDP) is a mathematical framework for modeling decision-making under uncertainty, where the agent's observations are incomplete and the underlying system dynamics are probabilistic. Solving the POMDP problem within the model-free paradigm is challenging for agents due to the inherent difficulty in accurately identifying and distinguishing between stat…
▽ More
Partially Observable Markov Decision Process (POMDP) is a mathematical framework for modeling decision-making under uncertainty, where the agent's observations are incomplete and the underlying system dynamics are probabilistic. Solving the POMDP problem within the model-free paradigm is challenging for agents due to the inherent difficulty in accurately identifying and distinguishing between states and observations. We define such a difficult problem as a DETerministic Partially Observable Markov Decision Process (DET-POMDP) problem, which is a specific setting of POMDP. In this problem, states and observations are in a many-to-one relationship. The state is obscured, and its relationship is less apparent to the agent. This creates obstacles for the agent to infer the state through observations. To effectively address this problem, we convert DET-POMDP into a fully observable MDP using a model-free biomimetics algorithm called BIOMAP. BIOMAP is based on the MDP Graph Automaton framework to distinguish authentic environmental information from fraudulent data. Thus, it enhances the agent's ability to develop stable policies against DET-POMDP. The experimental results highlight the superior capabilities of BIOMAP in maintaining operational effectiveness and environmental reparability in the presence of environmental deceptions when compared with existing POMDP solvers. This research opens up new avenues for the deployment of reliable POMDP-based systems in fields that are particularly susceptible to DET-POMDP problems.
△ Less
Submitted 19 December, 2024;
originally announced December 2024.
-
Enhanced MRI Representation via Cross-series Masking
Authors:
Churan Wang,
Fei Gao,
Lijun Yan,
Siwen Wang,
Yizhou Yu,
Yizhou Wang
Abstract:
Magnetic resonance imaging (MRI) is indispensable for diagnosing and planning treatment in various medical conditions due to its ability to produce multi-series images that reveal different tissue characteristics. However, integrating these diverse series to form a coherent analysis presents significant challenges, such as differing spatial resolutions and contrast patterns meanwhile requiring ext…
▽ More
Magnetic resonance imaging (MRI) is indispensable for diagnosing and planning treatment in various medical conditions due to its ability to produce multi-series images that reveal different tissue characteristics. However, integrating these diverse series to form a coherent analysis presents significant challenges, such as differing spatial resolutions and contrast patterns meanwhile requiring extensive annotated data, which is scarce in clinical practice. Due to these issues, we introduce a novel Cross-Series Masking (CSM) Strategy for effectively learning MRI representation in a self-supervised manner. Specifically, CSM commences by randomly sampling a subset of regions and series, which are then strategically masked. In the training process, the cross-series representation is learned by utilizing the unmasked data to reconstruct the masked portions. This process not only integrates information across different series but also facilitates the ability to model both intra-series and inter-series correlations and complementarities. With the learned representation, the downstream tasks like segmentation and classification are also enhanced. Taking brain tissue segmentation, breast tumor benign/malignant classification, and prostate cancer diagnosis as examples, our method achieves state-of-the-art performance on both public and in-house datasets.
△ Less
Submitted 10 December, 2024;
originally announced December 2024.
-
Curriculum Fine-tuning of Vision Foundation Model for Medical Image Classification Under Label Noise
Authors:
Yeonguk Yu,
Minhwan Ko,
Sungho Shin,
Kangmin Kim,
Kyoobin Lee
Abstract:
Deep neural networks have demonstrated remarkable performance in various vision tasks, but their success heavily depends on the quality of the training data. Noisy labels are a critical issue in medical datasets and can significantly degrade model performance. Previous clean sample selection methods have not utilized the well pre-trained features of vision foundation models (VFMs) and assumed that…
▽ More
Deep neural networks have demonstrated remarkable performance in various vision tasks, but their success heavily depends on the quality of the training data. Noisy labels are a critical issue in medical datasets and can significantly degrade model performance. Previous clean sample selection methods have not utilized the well pre-trained features of vision foundation models (VFMs) and assumed that training begins from scratch. In this paper, we propose CUFIT, a curriculum fine-tuning paradigm of VFMs for medical image classification under label noise. Our method is motivated by the fact that linear probing of VFMs is relatively unaffected by noisy samples, as it does not update the feature extractor of the VFM, thus robustly classifying the training samples. Subsequently, curriculum fine-tuning of two adapters is conducted, starting with clean sample selection from the linear probing phase. Our experimental results demonstrate that CUFIT outperforms previous methods across various medical image benchmarks. Specifically, our method surpasses previous baselines by 5.0%, 2.1%, 4.6%, and 5.8% at a 40% noise rate on the HAM10000, APTOS-2019, BloodMnist, and OrgancMnist datasets, respectively. Furthermore, we provide extensive analyses to demonstrate the impact of our method on noisy label detection. For instance, our method shows higher label precision and recall compared to previous approaches. Our work highlights the potential of leveraging VFMs in medical image classification under challenging conditions of noisy labels.
△ Less
Submitted 29 November, 2024;
originally announced December 2024.
-
A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning
Authors:
Luis Vilaca,
Yi Yu,
Paula Vinan
Abstract:
Audio-visual correlation learning aims to capture and understand natural phenomena between audio and visual data. The rapid growth of Deep Learning propelled the development of proposals that process audio-visual data and can be observed in the number of proposals in the past years. Thus encouraging the development of a comprehensive survey. Besides analyzing the models used in this context, we al…
▽ More
Audio-visual correlation learning aims to capture and understand natural phenomena between audio and visual data. The rapid growth of Deep Learning propelled the development of proposals that process audio-visual data and can be observed in the number of proposals in the past years. Thus encouraging the development of a comprehensive survey. Besides analyzing the models used in this context, we also discuss some tasks of definition and paradigm applied in AI multimedia. In addition, we investigate objective functions frequently used and discuss how audio-visual data is exploited in the optimization process, i.e., the different methodologies for representing knowledge in the audio-visual domain. In fact, we focus on how human-understandable mechanisms, i.e., structured knowledge that reflects comprehensible knowledge, can guide the learning process. Most importantly, we provide a summarization of the recent progress of Audio-Visual Correlation Learning (AVCL) and discuss the future research directions.
△ Less
Submitted 23 November, 2024;
originally announced December 2024.
-
Robust and Constrained Estimation of State-Space Models: A Majorization-Minimization Approach
Authors:
Yifan Yu,
Shengjie Xiu,
Daniel P. Palomar
Abstract:
In this paper, we present a novel optimization algorithm designed specifically for estimating state-space models to deal with heavy-tailed measurement noise and constraints. Our algorithm addresses two significant limitations found in existing approaches: susceptibility to measurement noise outliers and difficulties in incorporating constraints into state estimation. By formulating constrained sta…
▽ More
In this paper, we present a novel optimization algorithm designed specifically for estimating state-space models to deal with heavy-tailed measurement noise and constraints. Our algorithm addresses two significant limitations found in existing approaches: susceptibility to measurement noise outliers and difficulties in incorporating constraints into state estimation. By formulating constrained state estimation as an optimization problem and employing the Majorization-Minimization (MM) approach, our framework provides a unified solution that enhances the robustness of the Kalman filter. Experimental results demonstrate high accuracy and computational efficiency achieved by our proposed approach, establishing it as a promising solution for robust and constrained state estimation in real-world applications.
△ Less
Submitted 18 November, 2024;
originally announced November 2024.
-
Sensing Capacity for Integrated Sensing and Communication Systems in Low-Altitude Economy
Authors:
Jiahua Wan,
Hong Ren,
Cunhua Pan,
Zhenkun Zhang,
Songtao Gao,
Yiming Yu,
Chengzhong Wang
Abstract:
The burgeoning significance of the low-altitude economy (LAE) has garnered considerable interest, largely fuelled by the widespread deployment of unmanned aerial vehicles (UAVs). To tackle the challenges associated with the detection of unauthorized UAVs and the efficient scheduling of authorized UAVs, this letter introduces a novel performance metric, termed sensing capacity, for integrated sensi…
▽ More
The burgeoning significance of the low-altitude economy (LAE) has garnered considerable interest, largely fuelled by the widespread deployment of unmanned aerial vehicles (UAVs). To tackle the challenges associated with the detection of unauthorized UAVs and the efficient scheduling of authorized UAVs, this letter introduces a novel performance metric, termed sensing capacity, for integrated sensing and communication (ISAC) systems. This metric, which quantifies the capability of a base station (BS) to detect multiple UAVs simultaneously, leverages signal-to-noise ratio (SNR) and probability of detection (PD) as key intermediate variables. Through mathematical derivations, we can derive a closed-form solution for the maximum number of UAVs that can be detected by the BS while adhering to a specific SNR constraint. Furthermore, an approximate solution based on PD constraints is proposed to facilitate the efficient determination of the threshold for the maximum number of detectable UAVs. The accuracy of this analytical approach is verified through extensive simulation results.
△ Less
Submitted 11 November, 2024;
originally announced November 2024.
-
Wireless-Friendly Window Position Optimization for RIS-Aided Outdoor-to-Indoor Networks based on Multi-Modal Large Language Model
Authors:
Jinbo Hou,
Kehai Qiu,
Zitian Zhang,
Yong Yu,
Kezhi Wang,
Stefano Capolongo,
Jiliang Zhang,
Zeyang Li,
Jie Zhang
Abstract:
This paper aims to simultaneously optimize indoor wireless and daylight performance by adjusting the positions of windows and the beam directions of window-deployed reconfigurable intelligent surfaces (RISs) for RIS-aided outdoor-to-indoor (O2I) networks utilizing large language models (LLM) as optimizers. Firstly, we illustrate the wireless and daylight system models of RIS-aided O2I networks and…
▽ More
This paper aims to simultaneously optimize indoor wireless and daylight performance by adjusting the positions of windows and the beam directions of window-deployed reconfigurable intelligent surfaces (RISs) for RIS-aided outdoor-to-indoor (O2I) networks utilizing large language models (LLM) as optimizers. Firstly, we illustrate the wireless and daylight system models of RIS-aided O2I networks and formulate a joint optimization problem to enhance both wireless traffic sum rate and daylight illumination performance. Then, we present a multi-modal LLM-based window optimization (LMWO) framework, accompanied by a prompt construction template to optimize the overall performance in a zero-shot fashion, functioning as both an architect and a wireless network planner. Finally, we analyze the optimization performance of the LMWO framework and the impact of the number of windows, room size, number of RIS units, and daylight factor. Numerical results demonstrate that our proposed LMWO framework can achieve outstanding optimization performance in terms of initial performance, convergence speed, final outcomes, and time complexity, compared with classic optimization methods. The building's wireless performance can be significantly enhanced while ensuring indoor daylight performance.
△ Less
Submitted 20 June, 2025; v1 submitted 7 October, 2024;
originally announced October 2024.
-
ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features
Authors:
Xin Wei,
Yaling Tao,
Changde Du,
Gangming Zhao,
Yizhou Yu,
Jinpeng Li
Abstract:
Mammography is the primary imaging tool for breast cancer diagnosis. Despite significant strides in applying deep learning to interpret mammography images, efforts that focus predominantly on visual features often struggle with generalization across datasets. We hypothesize that integrating additional modalities in the radiology practice, notably the linguistic features of reports and manifestatio…
▽ More
Mammography is the primary imaging tool for breast cancer diagnosis. Despite significant strides in applying deep learning to interpret mammography images, efforts that focus predominantly on visual features often struggle with generalization across datasets. We hypothesize that integrating additional modalities in the radiology practice, notably the linguistic features of reports and manifestation features embodying radiological insights, offers a more powerful, interpretable and generalizable representation. In this paper, we announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports. Based on this dataset, we focus on the challanging task of unsupervised pretraining and propose ViKL, a innovative framework that synergizes Visual, Knowledge, and Linguistic features. This framework relies solely on pairing information without the necessity for pathology labels, which are often challanging to acquire. ViKL employs a triple contrastive learning approach to merge linguistic and knowledge-based insights with visual data, enabling both inter-modality and intra-modality feature enhancement. Our research yields significant findings: 1) Integrating reports and manifestations with unsupervised visual pretraining, ViKL substantially enhances the pathological classification and fosters multimodal interactions. 2) Manifestations can introduce a novel hard negative sample selection mechanism. 3) The multimodal features demonstrate transferability across different datasets. 4) The multimodal pretraining approach curbs miscalibrations and crafts a high-quality representation space. The MVKL dataset and ViKL code are publicly available at https://github.com/wxwxwwxxx/ViKL to support a broad spectrum of future research.
△ Less
Submitted 24 September, 2024;
originally announced September 2024.
-
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models
Authors:
Sijing Chen,
Yuan Feng,
Laipeng He,
Tianwei He,
Wendi He,
Yanni Hu,
Bin Lin,
Yiting Lin,
Yu Pan,
Pengfei Tan,
Chengwei Tian,
Chen Wang,
Zhicheng Wang,
Ruoye Xie,
Jixun Yao,
Quanlei Yan,
Yuguang Yang,
Jianhao Ye,
Jingjing Yin,
Yanzhen Yu,
Huimin Zhang,
Xiang Zhang,
Guangcheng Zhao,
Hongbin Zhou,
Pengpeng Zou
Abstract:
With the advent of the big data and large language model era, zero-shot personalized rapid customization has emerged as a significant trend. In this report, we introduce Takin AudioLLM, a series of techniques and models, mainly including Takin TTS, Takin VC, and Takin Morphing, specifically designed for audiobook production. These models are capable of zero-shot speech production, generating high-…
▽ More
With the advent of the big data and large language model era, zero-shot personalized rapid customization has emerged as a significant trend. In this report, we introduce Takin AudioLLM, a series of techniques and models, mainly including Takin TTS, Takin VC, and Takin Morphing, specifically designed for audiobook production. These models are capable of zero-shot speech production, generating high-quality speech that is nearly indistinguishable from real human speech and facilitating individuals to customize the speech content according to their own needs. Specifically, we first introduce Takin TTS, a neural codec language model that builds upon an enhanced neural speech codec and a multi-task training framework, capable of generating high-fidelity natural speech in a zero-shot way. For Takin VC, we advocate an effective content and timbre joint modeling approach to improve the speaker similarity, while advocating for a conditional flow matching based decoder to further enhance its naturalness and expressiveness. Last, we propose the Takin Morphing system with highly decoupled and advanced timbre and prosody modeling approaches, which enables individuals to customize speech production with their preferred timbre and prosody in a precise and controllable manner. Extensive experiments validate the effectiveness and robustness of our Takin AudioLLM series models. For detailed demos, please refer to https://everest-ai.github.io/takinaudiollm/.
△ Less
Submitted 23 September, 2024; v1 submitted 18 September, 2024;
originally announced September 2024.
-
Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm
Authors:
Yuning Wu,
Jiatong Shi,
Yifeng Yu,
Yuxun Tang,
Tao Qian,
Yueqian Lin,
Jionghao Han,
Xinyi Bai,
Shinji Watanabe,
Qin Jin
Abstract:
This research presents Muskits-ESPnet, a versatile toolkit that introduces new paradigms to Singing Voice Synthesis (SVS) through the application of pretrained audio models in both continuous and discrete approaches. Specifically, we explore discrete representations derived from SSL models and audio codecs and offer significant advantages in versatility and intelligence, supporting multi-format in…
▽ More
This research presents Muskits-ESPnet, a versatile toolkit that introduces new paradigms to Singing Voice Synthesis (SVS) through the application of pretrained audio models in both continuous and discrete approaches. Specifically, we explore discrete representations derived from SSL models and audio codecs and offer significant advantages in versatility and intelligence, supporting multi-format inputs and adaptable data processing workflows for various SVS models. The toolkit features automatic music score error detection and correction, as well as a perception auto-evaluation module to imitate human subjective evaluating scores. Muskits-ESPnet is available at \url{https://github.com/espnet/espnet}.
△ Less
Submitted 10 October, 2024; v1 submitted 11 September, 2024;
originally announced September 2024.
-
A Centralized Discovery-Based Method for Integrating Data Distribution Service and Time-Sensitive Networking in In-Vehicle Networks
Authors:
Feng Luo,
Yi Ren,
Yanhua Yu,
Yunpeng Li,
Zitong Wang
Abstract:
As the electronic and electrical architecture (E/EA) of intelligent and connected vehicles (ICVs) evolves, traditional distributed and signal-oriented architectures are being replaced by centralized, service-oriented architectures (SOA). This new generation of E/EA demands in-vehicle networks (IVNs) that offer high bandwidth, real-time, reliability, and service-oriented. data distribution service…
▽ More
As the electronic and electrical architecture (E/EA) of intelligent and connected vehicles (ICVs) evolves, traditional distributed and signal-oriented architectures are being replaced by centralized, service-oriented architectures (SOA). This new generation of E/EA demands in-vehicle networks (IVNs) that offer high bandwidth, real-time, reliability, and service-oriented. data distribution service (DDS) and time-sensitive networking (TSN) are increasingly adopted to address these requirements. However, research on the integrated deployment of DDS and TSN in automotive applications is still in its infancy. This paper presents a DDS over TSN (DoT) communication architecture based on the centralized discovery architecture (CDA). First, a lightweight DDS implementation (FastDDS-lw) is developed for resource-constrained in-vehicle devices. Next, a DDS flow identification algorithm (DFIA) based on the CDA is introduced to identify potential DDS flows during the discovery phase automatically. Finally, the DoT communication architecture is designed, incorporating FastDDS-lw and DFIA. Experimental results show that the DoT architecture significantly reduces end-to-end latency and jitter for critical DDS flows compared to traditional Ethernet. Additionally, DoT provides an automated network configuration method that completes within a few tens of milliseconds.
△ Less
Submitted 6 September, 2024;
originally announced September 2024.
-
Heterogeneous Space Fusion and Dual-Dimension Attention: A New Paradigm for Speech Enhancement
Authors:
Tao Zheng,
Liejun Wang,
Yinfeng Yu
Abstract:
Self-supervised learning has demonstrated impressive performance in speech tasks, yet there remains ample opportunity for advancement in the realm of speech enhancement research. In addressing speech tasks, confining the attention mechanism solely to the temporal dimension poses limitations in effectively focusing on critical speech features. Considering the aforementioned issues, our study introd…
▽ More
Self-supervised learning has demonstrated impressive performance in speech tasks, yet there remains ample opportunity for advancement in the realm of speech enhancement research. In addressing speech tasks, confining the attention mechanism solely to the temporal dimension poses limitations in effectively focusing on critical speech features. Considering the aforementioned issues, our study introduces a novel speech enhancement framework, HFSDA, which skillfully integrates heterogeneous spatial features and incorporates a dual-dimension attention mechanism to significantly enhance speech clarity and quality in noisy environments. By leveraging self-supervised learning embeddings in tandem with Short-Time Fourier Transform (STFT) spectrogram features, our model excels at capturing both high-level semantic information and detailed spectral data, enabling a more thorough analysis and refinement of speech signals. Furthermore, we employ the innovative Omni-dimensional Dynamic Convolution (ODConv) technology within the spectrogram input branch, enabling enhanced extraction and integration of crucial information across multiple dimensions. Additionally, we refine the Conformer model by enhancing its feature extraction capabilities not only in the temporal dimension but also across the spectral domain. Extensive experiments on the VCTK-DEMAND dataset show that HFSDA is comparable to existing state-of-the-art models, confirming the validity of our approach.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders
Authors:
Yubing Cao,
Yongming Li,
Liejun Wang,
Yinfeng Yu
Abstract:
Since the introduction of Generative Adversarial Networks (GANs) in speech synthesis, remarkable achievements have been attained. In a thorough exploration of vocoders, it has been discovered that audio waveforms can be generated at speeds exceeding real-time while maintaining high fidelity, achieved through the utilization of GAN-based models. Typically, the inputs to the vocoder consist of band-…
▽ More
Since the introduction of Generative Adversarial Networks (GANs) in speech synthesis, remarkable achievements have been attained. In a thorough exploration of vocoders, it has been discovered that audio waveforms can be generated at speeds exceeding real-time while maintaining high fidelity, achieved through the utilization of GAN-based models. Typically, the inputs to the vocoder consist of band-limited spectral information, which inevitably sacrifices high-frequency details. To address this, we adopt the full-band Mel spectrogram information as input, aiming to provide the vocoder with the most comprehensive information possible. However, previous studies have revealed that the use of full-band spectral information as input can result in the issue of over-smoothing, compromising the naturalness of the synthesized speech. To tackle this challenge, we propose VNet, a GAN-based neural vocoder network that incorporates full-band spectral information and introduces a Multi-Tier Discriminator (MTD) comprising multiple sub-discriminators to generate high-resolution signals. Additionally, we introduce an asymptotically constrained method that modifies the adversarial loss of the generator and discriminator, enhancing the stability of the training process. Through rigorous experiments, we demonstrate that the VNet model is capable of generating high-fidelity speech and significantly improving the performance of the vocoder.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
BSS-CFFMA: Cross-Domain Feature Fusion and Multi-Attention Speech Enhancement Network based on Self-Supervised Embedding
Authors:
Alimjan Mattursun,
Liejun Wang,
Yinfeng Yu
Abstract:
Speech self-supervised learning (SSL) represents has achieved state-of-the-art (SOTA) performance in multiple downstream tasks. However, its application in speech enhancement (SE) tasks remains immature, offering opportunities for improvement. In this study, we introduce a novel cross-domain feature fusion and multi-attention speech enhancement network, termed BSS-CFFMA, which leverages self-super…
▽ More
Speech self-supervised learning (SSL) represents has achieved state-of-the-art (SOTA) performance in multiple downstream tasks. However, its application in speech enhancement (SE) tasks remains immature, offering opportunities for improvement. In this study, we introduce a novel cross-domain feature fusion and multi-attention speech enhancement network, termed BSS-CFFMA, which leverages self-supervised embeddings. BSS-CFFMA comprises a multi-scale cross-domain feature fusion (MSCFF) block and a residual hybrid multi-attention (RHMA) block. The MSCFF block effectively integrates cross-domain features, facilitating the extraction of rich acoustic information. The RHMA block, serving as the primary enhancement module, utilizes three distinct attention modules to capture diverse attention representations and estimate high-quality speech signals.
We evaluate the performance of the BSS-CFFMA model through comparative and ablation studies on the VoiceBank-DEMAND dataset, achieving SOTA results. Furthermore, we select three types of data from the WHAMR! dataset, a collection specifically designed for speech enhancement tasks, to assess the capabilities of BSS-CFFMA in tasks such as denoising only, dereverberation only, and simultaneous denoising and dereverberation. This study marks the first attempt to explore the effectiveness of self-supervised embedding-based speech enhancement methods in complex tasks encompassing dereverberation and simultaneous denoising and dereverberation. The demo implementation of BSS-CFFMA is available online\footnote[2]{https://github.com/AlimMat/BSS-CFFMA. \label{s1}}.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
Learning Networked Dynamical System Models with Weak Form and Graph Neural Networks
Authors:
Yin Yu,
Daning Huang,
Seho Park,
Herschel C. Pangborn
Abstract:
This paper presents a sequence of two approaches for the data-driven control-oriented modeling of networked systems, i.e., the systems that involve many interacting dynamical components. First, a novel deep learning approach named the weak Latent Dynamics Model (wLDM) is developed for learning generic nonlinear dynamics with control. Leveraging the weak form, the wLDM enables more numerically stab…
▽ More
This paper presents a sequence of two approaches for the data-driven control-oriented modeling of networked systems, i.e., the systems that involve many interacting dynamical components. First, a novel deep learning approach named the weak Latent Dynamics Model (wLDM) is developed for learning generic nonlinear dynamics with control. Leveraging the weak form, the wLDM enables more numerically stable and computationally efficient training as well as more accurate prediction, when compared to conventional methods such as neural ordinary differential equations. Building upon the wLDM framework, we propose the weak Graph Koopman Bilinear Form (wGKBF) model, which integrates geometric deep learning and Koopman theory to learn latent space dynamics for networked systems, especially for the challenging cases having multiple timescales. The effectiveness of the wLDM framework and wGKBF model are demonstrated on three example systems of increasing complexity - a controlled double pendulum, the stiff Brusselator dynamics, and an electrified aircraft energy system. These numerical examples show that the wLDM and wGKBF achieve superior predictive accuracy and training efficiency as compared to baseline models. Parametric studies provide insights into the effects of hyperparameters in the weak form. The proposed framework shows the capability to efficiently capture control-dependent dynamics in these systems, including stiff dynamics and multi-physics interactions, offering a promising direction for learning control-oriented models of complex networked systems.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
PCQ: Emotion Recognition in Speech via Progressive Channel Querying
Authors:
Xincheng Wang,
Liejun Wang,
Yinfeng Yu,
Xinxin Jiao
Abstract:
In human-computer interaction (HCI), Speech Emotion Recognition (SER) is a key technology for understanding human intentions and emotions. Traditional SER methods struggle to effectively capture the long-term temporal correla-tions and dynamic variations in complex emotional expressions. To overcome these limitations, we introduce the PCQ method, a pioneering approach for SER via \textbf{P}rogress…
▽ More
In human-computer interaction (HCI), Speech Emotion Recognition (SER) is a key technology for understanding human intentions and emotions. Traditional SER methods struggle to effectively capture the long-term temporal correla-tions and dynamic variations in complex emotional expressions. To overcome these limitations, we introduce the PCQ method, a pioneering approach for SER via \textbf{P}rogressive \textbf{C}hannel \textbf{Q}uerying. This method can drill down layer by layer in the channel dimension through the channel query technique to achieve dynamic modeling of long-term contextual information of emotions. This mul-ti-level analysis gives the PCQ method an edge in capturing the nuances of hu-man emotions. Experimental results show that our model improves the weighted average (WA) accuracy by 3.98\% and 3.45\% and the unweighted av-erage (UA) accuracy by 5.67\% and 5.83\% on the IEMOCAP and EMODB emotion recognition datasets, respectively, significantly exceeding the baseline levels.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
UnmixingSR: Material-aware Network with Unsupervised Unmixing as Auxiliary Task for Hyperspectral Image Super-resolution
Authors:
Yang Yu
Abstract:
Deep learning-based (DL-based) hyperspectral image (HIS) super-resolution (SR) methods have achieved remarkable performance and attracted attention in industry and academia. Nonetheless, most current methods explored and learned the mapping relationship between low-resolution (LR) and high-resolution (HR) HSIs, leading to the side effect of increasing unreliability and irrationality in solving the…
▽ More
Deep learning-based (DL-based) hyperspectral image (HIS) super-resolution (SR) methods have achieved remarkable performance and attracted attention in industry and academia. Nonetheless, most current methods explored and learned the mapping relationship between low-resolution (LR) and high-resolution (HR) HSIs, leading to the side effect of increasing unreliability and irrationality in solving the ill-posed SR problem. We find, quite interestingly, LR imaging is similar to the mixed pixel phenomenon. A single photodetector in sensor arrays receives the reflectance signals reflected by a number of classes, resulting in low spatial resolution and mixed pixel problems. Inspired by this observation, this paper proposes a component-aware HSI SR network called UnmixingSR, in which the unsupervised HU as an auxiliary task is used to perceive the material components of HSIs. We regard HU as an auxiliary task and incorporate it into the HSI SR process by exploring the constraints between LR and HR abundances. Instead of only learning the mapping relationship between LR and HR HSIs, we leverage the bond between LR abundances and HR abundances to boost the stability of our method in solving SR problems. Moreover, the proposed unmixing process can be embedded into existing deep SR models as a plug-in-play auxiliary task. Experimental results on hyperspectral experiments show that unmixing process as an auxiliary task incorporated into the SR problem is feasible and rational, achieving outstanding performance. The code is available at
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Cloud-Edge-Terminal Collaborative AIGC for Autonomous Driving
Authors:
Jianan Zhang,
Zhiwei Wei,
Boxun Liu,
Xiayi Wang,
Yong Yu,
Rongqing Zhang
Abstract:
In dynamic autonomous driving environment, Artificial Intelligence-Generated Content (AIGC) technology can supplement vehicle perception and decision making by leveraging models' generative and predictive capabilities, and has the potential to enhance motion planning, trajectory prediction and traffic simulation. This article proposes a cloud-edge-terminal collaborative architecture to support AIG…
▽ More
In dynamic autonomous driving environment, Artificial Intelligence-Generated Content (AIGC) technology can supplement vehicle perception and decision making by leveraging models' generative and predictive capabilities, and has the potential to enhance motion planning, trajectory prediction and traffic simulation. This article proposes a cloud-edge-terminal collaborative architecture to support AIGC for autonomous driving. By delving into the unique properties of AIGC services, this article initiates the attempts to construct mutually supportive AIGC and network systems for autonomous driving, including communication, storage and computation resource allocation schemes to support AIGC services, and leveraging AIGC to assist system design and resource management.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Data on the Move: Traffic-Oriented Data Trading Platform Powered by AI Agent with Common Sense
Authors:
Yi Yu,
Shengyue Yao,
Tianchen Zhou,
Yexuan Fu,
Jingru Yu,
Ding Wang,
Xuhong Wang,
Cen Chen,
Yilun Lin
Abstract:
In the digital era, data has become a pivotal asset, advancing technologies such as autonomous driving. Despite this, data trading faces challenges like the absence of robust pricing methods and the lack of trustworthy trading mechanisms. To address these challenges, we introduce a traffic-oriented data trading platform named Data on The Move (DTM), integrating traffic simulation, data trading, an…
▽ More
In the digital era, data has become a pivotal asset, advancing technologies such as autonomous driving. Despite this, data trading faces challenges like the absence of robust pricing methods and the lack of trustworthy trading mechanisms. To address these challenges, we introduce a traffic-oriented data trading platform named Data on The Move (DTM), integrating traffic simulation, data trading, and Artificial Intelligent (AI) agents. The DTM platform supports evident-based data value evaluation and AI-based trading mechanisms. Leveraging the common sense capabilities of Large Language Models (LLMs) to assess traffic state and data value, DTM can determine reasonable traffic data pricing through multi-round interaction and simulations. Moreover, DTM provides a pricing method validation by simulating traffic systems, multi-agent interactions, and the heterogeneity and irrational behaviors of individuals in the trading market. Within the DTM platform, entities such as connected vehicles and traffic light controllers could engage in information collecting, data pricing, trading, and decision-making. Simulation results demonstrate that our proposed AI agent-based pricing approach enhances data trading by offering rational prices, as evidenced by the observed improvement in traffic efficiency. This underscores the effectiveness and practical value of DTM, offering new perspectives for the evolution of data markets and smart cities. To the best of our knowledge, this is the first study employing LLMs in data pricing and a pioneering data trading practice in the field of intelligent vehicles and smart cities.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation
Authors:
Yifeng Yu,
Jiatong Shi,
Yuning Wu,
Yuxun Tang,
Shinji Watanabe
Abstract:
Singing Voice Synthesis (SVS) has witnessed significant advancements with the advent of deep learning techniques. However, a significant challenge in SVS is the scarcity of labeled singing voice data, which limits the effectiveness of supervised learning methods. In response to this challenge, this paper introduces a novel approach to enhance the quality of SVS by leveraging unlabeled data from pr…
▽ More
Singing Voice Synthesis (SVS) has witnessed significant advancements with the advent of deep learning techniques. However, a significant challenge in SVS is the scarcity of labeled singing voice data, which limits the effectiveness of supervised learning methods. In response to this challenge, this paper introduces a novel approach to enhance the quality of SVS by leveraging unlabeled data from pre-trained self-supervised learning models. Building upon the existing VISinger2 framework, this study integrates additional spectral feature information into the system to enhance its performance. The integration aims to harness the rich acoustic features from the pre-trained models, thereby enriching the synthesis and yielding a more natural and expressive singing voice. Experimental results in various corpora demonstrate the efficacy of this approach in improving the overall quality of synthesized singing voices in both objective and subjective metrics.
△ Less
Submitted 13 December, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.