-
Aligned Better, Listen Better for Audio-Visual Large Language Models
Authors:
Yuxin Guo,
Shuailei Ma,
Shijie Ma,
Xiaoyi Bao,
Chen-Wei Xie,
Kecheng Zheng,
Tingyu Weng,
Siyang Sun,
Yun Zheng,
Wei Zou
Abstract:
Audio is essential for multimodal video understanding. On the one hand, video inherently contains audio, which supplies complementary information to vision. Besides, video large language models (Video-LLMs) can encounter many audio-centric settings. However, existing Video-LLMs and Audio-Visual Large Language Models (AV-LLMs) exhibit deficiencies in exploiting audio information, leading to weak un…
▽ More
Audio is essential for multimodal video understanding. On the one hand, video inherently contains audio, which supplies complementary information to vision. Besides, video large language models (Video-LLMs) can encounter many audio-centric settings. However, existing Video-LLMs and Audio-Visual Large Language Models (AV-LLMs) exhibit deficiencies in exploiting audio information, leading to weak understanding and hallucinations. To solve the issues, we delve into the model architecture and dataset. (1) From the architectural perspective, we propose a fine-grained AV-LLM, namely Dolphin. The concurrent alignment of audio and visual modalities in both temporal and spatial dimensions ensures a comprehensive and accurate understanding of videos. Specifically, we devise an audio-visual multi-scale adapter for multi-scale information aggregation, which achieves spatial alignment. For temporal alignment, we propose audio-visual interleaved merging. (2) From the dataset perspective, we curate an audio-visual caption and instruction-tuning dataset, called AVU. It comprises 5.2 million diverse, open-ended data tuples (video, audio, question, answer) and introduces a novel data partitioning strategy. Extensive experiments show our model not only achieves remarkable performance in audio-visual understanding, but also mitigates potential hallucinations.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
Easing Seasickness through Attention Redirection with a Mindfulness-Based Brain--Computer Interface
Authors:
Xiaoyu Bao,
Kailin Xu,
Jiawei Zhu,
Haiyun Huang,
Kangning Li,
Qiyun Huang,
Yuanqing Li
Abstract:
Seasickness is a prevalent issue that adversely impacts both passenger experiences and the operational efficiency of maritime crews. While techniques that redirect attention have proven effective in alleviating motion sickness symptoms in terrestrial environments, applying similar strategies to manage seasickness poses unique challenges due to the prolonged and intense motion environment associate…
▽ More
Seasickness is a prevalent issue that adversely impacts both passenger experiences and the operational efficiency of maritime crews. While techniques that redirect attention have proven effective in alleviating motion sickness symptoms in terrestrial environments, applying similar strategies to manage seasickness poses unique challenges due to the prolonged and intense motion environment associated with maritime travel. In this study, we propose a mindfulness brain-computer interface (BCI), specifically designed to redirect attention with the aim of mitigating seasickness symptoms in real-world settings. Our system utilizes a single-channel headband to capture prefrontal EEG signals, which are then wirelessly transmitted to computing devices for the assessment of mindfulness states. The results are transferred into real-time feedback as mindfulness scores and audiovisual stimuli, facilitating a shift in attentional focus from physiological discomfort to mindfulness practices. A total of 43 individuals participated in a real-world maritime experiment consisted of three sessions: a real-feedback mindfulness session, a resting session, and a pseudofeedback mindfulness session. Notably, 81.39% of participants reported that the mindfulness BCI intervention was effective, and there was a significant reduction in the severity of seasickness, as measured by the Misery Scale (MISC). Furthermore, EEG analysis revealed a decrease in the theta/beta ratio, corresponding with the alleviation of seasickness symptoms. A decrease in overall EEG band power during the real-feedback mindfulness session suggests that the mindfulness BCI fosters a more tranquil and downregulated state of brain activity. Together, this study presents a novel nonpharmacological, portable, and effective approach for seasickness intervention, with the potential to enhance the cruising experience for both passengers and crews.
△ Less
Submitted 14 January, 2025;
originally announced January 2025.
-
Diff4Steer: Steerable Diffusion Prior for Generative Music Retrieval with Semantic Guidance
Authors:
Xuchan Bao,
Judith Yue Li,
Zhong Yi Wan,
Kun Su,
Timo Denk,
Joonseok Lee,
Dima Kuzmin,
Fei Sha
Abstract:
Modern music retrieval systems often rely on fixed representations of user preferences, limiting their ability to capture users' diverse and uncertain retrieval needs. To address this limitation, we introduce Diff4Steer, a novel generative retrieval framework that employs lightweight diffusion models to synthesize diverse seed embeddings from user queries that represent potential directions for mu…
▽ More
Modern music retrieval systems often rely on fixed representations of user preferences, limiting their ability to capture users' diverse and uncertain retrieval needs. To address this limitation, we introduce Diff4Steer, a novel generative retrieval framework that employs lightweight diffusion models to synthesize diverse seed embeddings from user queries that represent potential directions for music exploration. Unlike deterministic methods that map user query to a single point in embedding space, Diff4Steer provides a statistical prior on the target modality (audio) for retrieval, effectively capturing the uncertainty and multi-faceted nature of user preferences. Furthermore, Diff4Steer can be steered by image or text inputs, enabling more flexible and controllable music discovery combined with nearest neighbor search. Our framework outperforms deterministic regression methods and LLM-based generative retrieval baseline in terms of retrieval and ranking metrics, demonstrating its effectiveness in capturing user preferences, leading to more diverse and relevant recommendations. Listening examples are available at tinyurl.com/diff4steer.
△ Less
Submitted 5 December, 2024;
originally announced December 2024.
-
Statistical AoI Guarantee Optimization for Supporting xURLLC in ISAC-enabled V2I Networks
Authors:
Yanxi Zhang,
Mingwu Yao,
Qinghai Yang,
Dongqi Yan,
Xu Zhang,
Xu Bao,
Muyu Mei
Abstract:
This paper addresses the critical challenge of supporting next-generation ultra-reliable and low-latency communication (xURLLC) within integrated sensing and communication (ISAC)-enabled vehicle-to-infrastructure (V2I) networks. We incorporate channel evaluation and retransmission mechanisms for real-time reliability enhancement. Using stochastic network calculus (SNC), we establish a theoretical…
▽ More
This paper addresses the critical challenge of supporting next-generation ultra-reliable and low-latency communication (xURLLC) within integrated sensing and communication (ISAC)-enabled vehicle-to-infrastructure (V2I) networks. We incorporate channel evaluation and retransmission mechanisms for real-time reliability enhancement. Using stochastic network calculus (SNC), we establish a theoretical framework to derive upper bounds for the peak age of information violation probability (PAVP) via characterized sensing and communication moment generation functions (MGFs). By optimizing these bounds, we develop power allocation schemes that significantly reduce the statistical PAVP of sensory packets in such networks. Simulations validate our theoretical derivations and demonstrate the effectiveness of our proposed schemes.
△ Less
Submitted 1 August, 2024;
originally announced August 2024.
-
SPPNet: A Single-Point Prompt Network for Nuclei Image Segmentation
Authors:
Qing Xu,
Wenwei Kuang,
Zeyu Zhang,
Xueyao Bao,
Haoran Chen,
Wenting Duan
Abstract:
Image segmentation plays an essential role in nuclei image analysis. Recently, the segment anything model has made a significant breakthrough in such tasks. However, the current model exists two major issues for cell segmentation: (1) the image encoder of the segment anything model involves a large number of parameters. Retraining or even fine-tuning the model still requires expensive computationa…
▽ More
Image segmentation plays an essential role in nuclei image analysis. Recently, the segment anything model has made a significant breakthrough in such tasks. However, the current model exists two major issues for cell segmentation: (1) the image encoder of the segment anything model involves a large number of parameters. Retraining or even fine-tuning the model still requires expensive computational resources. (2) in point prompt mode, points are sampled from the center of the ground truth and more than one set of points is expected to achieve reliable performance, which is not efficient for practical applications. In this paper, a single-point prompt network is proposed for nuclei image segmentation, called SPPNet. We replace the original image encoder with a lightweight vision transformer. Also, an effective convolutional block is added in parallel to extract the low-level semantic information from the image and compensate for the performance degradation due to the small image encoder. We propose a new point-sampling method based on the Gaussian kernel. The proposed model is evaluated on the MoNuSeg-2018 dataset. The result demonstrated that SPPNet outperforms existing U-shape architectures and shows faster convergence in training. Compared to the segment anything model, SPPNet shows roughly 20 times faster inference, with 1/70 parameters and computational cost. Particularly, only one set of points is required in both the training and inference phases, which is more reasonable for clinical applications. The code for our work and more technical details can be found at https://github.com/xq141839/SPPNet.
△ Less
Submitted 23 August, 2023;
originally announced August 2023.
-
Improving COVID-19 CT Classification of CNNs by Learning Parameter-Efficient Representation
Authors:
Yujia Xu,
Hak-Keung Lam,
Guangyu Jia,
Jian Jiang,
Junkai Liao,
Xinqi Bao
Abstract:
COVID-19 pandemic continues to spread rapidly over the world and causes a tremendous crisis in global human health and the economy. Its early detection and diagnosis are crucial for controlling the further spread. Many deep learning-based methods have been proposed to assist clinicians in automatic COVID-19 diagnosis based on computed tomography imaging. However, challenges still remain, including…
▽ More
COVID-19 pandemic continues to spread rapidly over the world and causes a tremendous crisis in global human health and the economy. Its early detection and diagnosis are crucial for controlling the further spread. Many deep learning-based methods have been proposed to assist clinicians in automatic COVID-19 diagnosis based on computed tomography imaging. However, challenges still remain, including low data diversity in existing datasets, and unsatisfied detection resulting from insufficient accuracy and sensitivity of deep learning models. To enhance the data diversity, we design augmentation techniques of incremental levels and apply them to the largest open-access benchmark dataset, COVIDx CT-2A. Meanwhile, similarity regularization (SR) derived from contrastive learning is proposed in this study to enable CNNs to learn more parameter-efficient representations, thus improving the accuracy and sensitivity of CNNs. The results on seven commonly used CNNs demonstrate that CNN performance can be improved stably through applying the designed augmentation and SR techniques. In particular, DenseNet121 with SR achieves an average test accuracy of 99.44% in three trials for three-category classification, including normal, non-COVID-19 pneumonia, and COVID-19 pneumonia. And the achieved precision, sensitivity, and specificity for the COVID-19 pneumonia category are 98.40%, 99.59%, and 99.50%, respectively. These statistics suggest that our method has surpassed the existing state-of-the-art methods on the COVIDx CT-2A dataset.
△ Less
Submitted 9 August, 2022;
originally announced August 2022.
-
Time-Frequency Distributions of Heart Sound Signals: A Comparative Study using Convolutional Neural Networks
Authors:
Xinqi Bao,
Yujia Xu,
Hak-Keung Lam,
Mohamed Trabelsi,
Ines Chihi,
Lilia Sidhom,
Ernest N. Kamavuako
Abstract:
Time-Frequency Distributions (TFDs) support the heart sound characterisation and classification in early cardiac screening. However, despite the frequent use of TFDs in signal analysis, no study comprehensively compared their performances on deep learning for automatic diagnosis. Furthermore, the combination of signal processing methods as inputs for Convolutional Neural Networks (CNNs) has been p…
▽ More
Time-Frequency Distributions (TFDs) support the heart sound characterisation and classification in early cardiac screening. However, despite the frequent use of TFDs in signal analysis, no study comprehensively compared their performances on deep learning for automatic diagnosis. Furthermore, the combination of signal processing methods as inputs for Convolutional Neural Networks (CNNs) has been proved as a practical approach to increasing signal classification performance. Therefore, this study aimed to investigate the optimal use of TFD/ combined TFDs as input for CNNs. The presented results revealed that: 1) The transformation of the heart sound signal into the TF domain achieves higher classification performance than using of raw signals. Among the TFDs, the difference in the performance was slight for all the CNN models (within $1.3\%$ in average accuracy). However, Continuous wavelet transform (CWT) and Chirplet transform (CT) outperformed the rest. 2) The appropriate increase of the CNN capacity and architecture optimisation can improve the performance, while the network architecture should not be overly complicated. Based on the ResNet or SEResNet family results, the increase in the number of parameters and the depth of the structure do not improve the performance apparently. 3) Combining TFDs as CNN inputs did not significantly improve the classification results. The findings of this study provided the knowledge for selecting TFDs as CNN input and designing CNN architecture for heart sound classification.
△ Less
Submitted 5 August, 2022;
originally announced August 2022.
-
Levenberg-Marquardt Method Based Cooperative Source Localization in SIMO Molecular Communication via Diffusion Systems
Authors:
Yuqi Miao,
Wence Zhang,
Xu Bao
Abstract:
Molecular communication underpins nano-scale communications in nanotechnology. The combination of multinanomachines to form nano-networks is one of the main enabling methods. Due to the importance of source localization in establishing nano-networks, this paper proposes a cooperative source localization method for Molecular Communication via Diffusion (MCvD) systems using multiple spherical absorp…
▽ More
Molecular communication underpins nano-scale communications in nanotechnology. The combination of multinanomachines to form nano-networks is one of the main enabling methods. Due to the importance of source localization in establishing nano-networks, this paper proposes a cooperative source localization method for Molecular Communication via Diffusion (MCvD) systems using multiple spherical absorption receivers. Since there is no exact mathematical expression of the channel impulse response for multiple absorbing receivers, we adopt an empirical expression and use Levenberg-Marquardt method to estimate the distance of the transmitter to each receiver, based on which the location of the transmitter is obtained using an iterative scheme where the initial point is obtained using a multi-point localization method. Particle based simulation is carried out to evaluate the performance of the proposed method. Simulation results show that the proposed method can accurately estimate the location of transmitter in short to medium communication ranges.
△ Less
Submitted 16 March, 2022;
originally announced March 2022.
-
W-Net: A Two-Stage Convolutional Network for Nucleus Detection in Histopathology Image
Authors:
Anyu Mao,
Jialun Wu,
Xinrui Bao,
Zeyu Gao,
Tieliang Gong,
Chen Li
Abstract:
Pathological diagnosis is the gold standard for cancer diagnosis, but it is labor-intensive, in which tasks such as cell detection, classification, and counting are particularly prominent. A common solution for automating these tasks is using nucleus segmentation technology. However, it is hard to train a robust nucleus segmentation model, due to several challenging problems, the nucleus adhesion,…
▽ More
Pathological diagnosis is the gold standard for cancer diagnosis, but it is labor-intensive, in which tasks such as cell detection, classification, and counting are particularly prominent. A common solution for automating these tasks is using nucleus segmentation technology. However, it is hard to train a robust nucleus segmentation model, due to several challenging problems, the nucleus adhesion, stacking, and excessive fusion with the background. Recently, some researchers proposed a series of automatic nucleus segmentation methods based on point annotation, which can significant improve the model performance. Nevertheless, the point annotation needs to be marked by experienced pathologists. In order to take advantage of segmentation methods based on point annotation, further alleviate the manual workload, and make cancer diagnosis more efficient and accurate, it is necessary to develop an automatic nucleus detection algorithm, which can automatically and efficiently locate the position of the nucleus in the pathological image and extract valuable information for pathologists. In this paper, we propose a W-shaped network for automatic nucleus detection. Different from the traditional U-Net based method, mapping the original pathology image to the target mask directly, our proposed method split the detection task into two sub-tasks. The first sub-task maps the original pathology image to the binary mask, then the binary mask is mapped to the density mask in the second sub-task. After the task is split, the task's difficulty is significantly reduced, and the network's overall performance is improved.
△ Less
Submitted 26 October, 2021;
originally announced October 2021.
-
A Precision Diagnostic Framework of Renal Cell Carcinoma on Whole-Slide Images using Deep Learning
Authors:
Jialun Wu,
Haichuan Zhang,
Zeyu Gao,
Xinrui Bao,
Tieliang Gong,
Chunbao Wang,
Chen Li
Abstract:
Diagnostic pathology, which is the basis and gold standard of cancer diagnosis, provides essential information on the prognosis of the disease and vital evidence for clinical treatment. Tumor region detection, subtype and grade classification are the fundamental diagnostic indicators for renal cell carcinoma (RCC) in whole-slide images (WSIs). However, pathological diagnosis is subjective, differe…
▽ More
Diagnostic pathology, which is the basis and gold standard of cancer diagnosis, provides essential information on the prognosis of the disease and vital evidence for clinical treatment. Tumor region detection, subtype and grade classification are the fundamental diagnostic indicators for renal cell carcinoma (RCC) in whole-slide images (WSIs). However, pathological diagnosis is subjective, differences in observation and diagnosis between pathologists is common in hospitals with inadequate diagnostic capacity. The main challenge for developing deep learning based RCC diagnostic system is the lack of large-scale datasets with precise annotations. In this work, we proposed a deep learning-based framework for analyzing histopathological images of patients with renal cell carcinoma, which has the potential to achieve pathologist-level accuracy in diagnosis. A deep convolutional neural network (InceptionV3) was trained on the high-quality annotated dataset of The Cancer Genome Atlas (TCGA) whole-slide histopathological image for accurate tumor area detection, classification of RCC subtypes, and ISUP grades classification of clear cell carcinoma subtypes. These results suggest that our framework can help pathologists in the detection of cancer region and classification of subtypes and grades, which could be applied to any cancer type, providing auxiliary diagnosis and promoting clinical consensus.
△ Less
Submitted 26 October, 2021;
originally announced October 2021.
-
Outage Analysis for Intelligent Reflecting Surface Assisted Vehicular Communication Networks
Authors:
Jue Wang,
Wence Zhang,
Xu Bao,
Tiecheng Song,
Cunhua Pan
Abstract:
Vehicular communication is an important application of the fifth generation of mobile communication systems (5G). Due to its low cost and energy efficiency, intelligent reflecting surface (IRS) has been envisioned as a promising technique that can enhance the coverage performance significantly by passive beamforming. In this paper, we analyze the outage probability performance in IRS-assisted vehi…
▽ More
Vehicular communication is an important application of the fifth generation of mobile communication systems (5G). Due to its low cost and energy efficiency, intelligent reflecting surface (IRS) has been envisioned as a promising technique that can enhance the coverage performance significantly by passive beamforming. In this paper, we analyze the outage probability performance in IRS-assisted vehicular communication networks. We derive the expression of outage probability by utilizing series expansion and central limit theorem. Numerical results show that the IRS can significantly reduce the outage probability for vehicles in its vicinity. The outage probability is closely related to the vehicle density and the number of IRS elements, and better performance is achieved with more reflecting elements.
△ Less
Submitted 20 April, 2020; v1 submitted 17 April, 2020;
originally announced April 2020.
-
Computational distributed fiber-optic sensing
Authors:
Da-Peng Zhou,
Wei Peng,
Liang Chen,
Xiaoyi Bao
Abstract:
Ghost imaging allows image reconstruction by correlation measurements between a light beam that interacts with the object without spatial resolution and a spatially resolved light beam that never interacts with the object. The two light beams are copies of each other. Its computational version removes the requirement of a spatially resolved detector when the light intensity pattern is pre-known. H…
▽ More
Ghost imaging allows image reconstruction by correlation measurements between a light beam that interacts with the object without spatial resolution and a spatially resolved light beam that never interacts with the object. The two light beams are copies of each other. Its computational version removes the requirement of a spatially resolved detector when the light intensity pattern is pre-known. Here, we exploit the temporal analogue of computational ghost imaging, and demonstrate a computational distributed fiber-optic sensing technique. Temporal images containing spatially distributed scattering information used for sensing purposes are retrieved through correlating the "integrated" backscattered light and the pre-known binary patterns. The sampling rate required for our technique is inversely proportional to the total time duration of a binary sequence, so that it can be significantly reduced compared to that of the traditional methods. Our experiments demonstrate a 3 orders of magnitude reduction in the sampling rate, offering great simplification and cost reduction in the distributed fiber-optic sensors.
△ Less
Submitted 14 April, 2019;
originally announced April 2019.
-
TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer
Authors:
Sicong Huang,
Qiyang Li,
Cem Anil,
Xuchan Bao,
Sageev Oore,
Roger B. Grosse
Abstract:
In this work, we address the problem of musical timbre transfer, where the goal is to manipulate the timbre of a sound sample from one instrument to match another instrument while preserving other musical content, such as pitch, rhythm, and loudness. In principle, one could apply image-based style transfer techniques to a time-frequency representation of an audio signal, but this depends on having…
▽ More
In this work, we address the problem of musical timbre transfer, where the goal is to manipulate the timbre of a sound sample from one instrument to match another instrument while preserving other musical content, such as pitch, rhythm, and loudness. In principle, one could apply image-based style transfer techniques to a time-frequency representation of an audio signal, but this depends on having a representation that allows independent manipulation of timbre as well as high-quality waveform generation. We introduce TimbreTron, a method for musical timbre transfer which applies "image" domain style transfer to a time-frequency representation of the audio signal, and then produces a high-quality waveform using a conditional WaveNet synthesizer. We show that the Constant Q Transform (CQT) representation is particularly well-suited to convolutional architectures due to its approximate pitch equivariance. Based on human perceptual evaluations, we confirmed that TimbreTron recognizably transferred the timbre while otherwise preserving the musical content, for both monophonic and polyphonic samples.
△ Less
Submitted 22 October, 2023; v1 submitted 22 November, 2018;
originally announced November 2018.
-
Deep Neural Networks for Improved, Impromptu Trajectory Tracking of Quadrotors
Authors:
Qiyang Li,
Jingxing Qian,
Zining Zhu,
Xuchan Bao,
Mohamed K. Helwa,
Angela P. Schoellig
Abstract:
Trajectory tracking control for quadrotors is important for applications ranging from surveying and inspection, to film making. However, designing and tuning classical controllers, such as proportional-integral-derivative (PID) controllers, to achieve high tracking precision can be time-consuming and difficult, due to hidden dynamics and other non-idealities. The Deep Neural Network (DNN), with it…
▽ More
Trajectory tracking control for quadrotors is important for applications ranging from surveying and inspection, to film making. However, designing and tuning classical controllers, such as proportional-integral-derivative (PID) controllers, to achieve high tracking precision can be time-consuming and difficult, due to hidden dynamics and other non-idealities. The Deep Neural Network (DNN), with its superior capability of approximating abstract, nonlinear functions, proposes a novel approach for enhancing trajectory tracking control. This paper presents a DNN-based algorithm as an add-on module that improves the tracking performance of a classical feedback controller. Given a desired trajectory, the DNNs provide a tailored reference input to the controller based on their gained experience. The input aims to achieve a unity map between the desired and the output trajectory. The motivation for this work is an interactive "fly-as-you-draw" application, in which a user draws a trajectory on a mobile device, and a quadrotor instantly flies that trajectory with the DNN-enhanced control system. Experimental results demonstrate that the proposed approach improves the tracking precision for user-drawn trajectories after the DNNs are trained on selected periodic trajectories, suggesting the method's potential in real-world applications. Tracking errors are reduced by around 40-50% for both training and testing trajectories from users, highlighting the DNNs' capability of generalizing knowledge.
△ Less
Submitted 19 July, 2017; v1 submitted 20 October, 2016;
originally announced October 2016.