Search | arXiv e-print repository

YuE: Scaling Open Foundation Models for Long-Form Music Generation

Authors: Ruibin Yuan, Hanfeng Lin, Shuyue Guo, Ge Zhang, Jiahao Pan, Yongyi Zang, Haohe Liu, Yiming Liang, Wenye Ma, Xingjian Du, Xinrun Du, Zhen Ye, Tianyu Zheng, Yinghao Ma, Minghao Liu, Zeyue Tian, Ziya Zhou, Liumeng Xue, Xingwei Qu, Yizhi Li, Shangda Wu, Tianhao Shen, Ziyang Ma, Jun Zhan, Chunhui Wang , et al. (32 additional authors not shown)

Abstract: We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate… ▽ More We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation △ Less

Submitted 11 March, 2025; originally announced March 2025.

Comments: https://github.com/multimodal-art-projection/YuE

arXiv:2409.10654 [pdf, other]

doi 10.1109/TBCAS.2024.3457522

An Ultra-Low Power Wearable BMI System with Continual Learning Capabilities

Authors: Lan Mei, Thorir Mar Ingolfsson, Cristian Cioflan, Victor Kartsch, Andrea Cossettini, Xiaying Wang, Luca Benini

Abstract: Driven by the progress in efficient embedded processing, there is an accelerating trend toward running machine learning models directly on wearable Brain-Machine Interfaces (BMIs) to improve portability and privacy and maximize battery life. However, achieving low latency and high classification performance remains challenging due to the inherent variability of electroencephalographic (EEG) signal… ▽ More Driven by the progress in efficient embedded processing, there is an accelerating trend toward running machine learning models directly on wearable Brain-Machine Interfaces (BMIs) to improve portability and privacy and maximize battery life. However, achieving low latency and high classification performance remains challenging due to the inherent variability of electroencephalographic (EEG) signals across sessions and the limited onboard resources. This work proposes a comprehensive BMI workflow based on a CNN-based Continual Learning (CL) framework, allowing the system to adapt to inter-session changes. The workflow is deployed on a wearable, parallel ultra-low power BMI platform (BioGAP). Our results based on two in-house datasets, Dataset A and Dataset B, show that the CL workflow improves average accuracy by up to 30.36% and 10.17%, respectively. Furthermore, when implementing the continual learning on a Parallel Ultra-Low Power (PULP) microcontroller (GAP9), it achieves an energy consumption as low as 0.45mJ per inference and an adaptation time of only 21.5ms, yielding around 25h of battery life with a small 100mAh, 3.7V battery on BioGAP. Our setup, coupled with the compact CNN model and on-device CL capabilities, meets users' needs for improved privacy, reduced latency, and enhanced inter-session performance, offering good promise for smart embedded real-world BMIs. △ Less

Submitted 16 September, 2024; originally announced September 2024.

Comments: 12 pages, 8 figures, to be published in IEEE Transactions on Biomedical Circuits and Systems (TBioCAS)

arXiv:2409.09161 [pdf, other]

Train-On-Request: An On-Device Continual Learning Workflow for Adaptive Real-World Brain Machine Interfaces

Authors: Lan Mei, Cristian Cioflan, Thorir Mar Ingolfsson, Victor Kartsch, Andrea Cossettini, Xiaying Wang, Luca Benini

Abstract: Brain-machine interfaces (BMIs) are expanding beyond clinical settings thanks to advances in hardware and algorithms. However, they still face challenges in user-friendliness and signal variability. Classification models need periodic adaptation for real-life use, making an optimal re-training strategy essential to maximize user acceptance and maintain high performance. We propose TOR, a train-on-… ▽ More Brain-machine interfaces (BMIs) are expanding beyond clinical settings thanks to advances in hardware and algorithms. However, they still face challenges in user-friendliness and signal variability. Classification models need periodic adaptation for real-life use, making an optimal re-training strategy essential to maximize user acceptance and maintain high performance. We propose TOR, a train-on-request workflow that enables user-specific model adaptation to novel conditions, addressing signal variability over time. Using continual learning, TOR preserves knowledge across sessions and mitigates inter-session variability. With TOR, users can refine, on demand, the model through on-device learning (ODL) to enhance accuracy adapting to changing conditions. We evaluate the proposed methodology on a motor-movement dataset recorded with a non-stigmatizing wearable BMI headband, achieving up to 92% accuracy and a re-calibration time as low as 1.6 minutes, a 46% reduction compared to a naive transfer learning workflow. We additionally demonstrate that TOR is suitable for ODL in extreme edge settings by deploying the training procedure on a RISC-V ultra-low-power SoC (GAP9), resulting in 21.6 ms of latency and 1 mJ of energy consumption per training step. To the best of our knowledge, this work is the first demonstration of an online, energy-efficient, dynamic adaptation of a BMI model to the intrinsic variability of EEG signals in real-time settings. △ Less

Submitted 22 September, 2024; v1 submitted 13 September, 2024; originally announced September 2024.

Comments: 5 pages, 6 figures, to be published in 2024 IEEE Biomedical Circuits and Systems Conference (BioCAS)

arXiv:2406.07161 [pdf, other]

doi 10.1109/ICCD58817.2023.00065

ACCO: Automated Causal CNN Scheduling Optimizer for Real-Time Edge Accelerators

Authors: Jun Yin, Linyan Mei, Andre Guntoro, Marian Verhelst

Abstract: Spatio-Temporal Convolutional Neural Networks (ST-CNN) allow extending CNN capabilities from image processing to consecutive temporal-pattern recognition. Generally, state-of-the-art (SotA) ST-CNNs inflate the feature maps and weights from well-known CNN backbones to represent the additional time dimension. However, edge computing applications would suffer tremendously from such large computation… ▽ More Spatio-Temporal Convolutional Neural Networks (ST-CNN) allow extending CNN capabilities from image processing to consecutive temporal-pattern recognition. Generally, state-of-the-art (SotA) ST-CNNs inflate the feature maps and weights from well-known CNN backbones to represent the additional time dimension. However, edge computing applications would suffer tremendously from such large computation or memory overhead. Fortunately, the overlapping nature of ST-CNN enables various optimizations, such as the dilated causal convolution structure and Depth-First (DF) layer fusion to reuse the computation between time steps and CNN sliding windows, respectively. Yet, no hardware-aware approach has been proposed that jointly explores the optimal strategy from a scheduling as well as a hardware point of view. To this end, we present ACCO, an automated optimizer that explores efficient Causal CNN transformation and DF scheduling for ST-CNNs on edge hardware accelerators. By cost-modeling the computation and data movement on the accelerator architecture, ACCO automatically selects the best scheduling strategy for the given hardware-algorithm target. Compared to the fixed dilated causal structure, ST-CNNs with ACCO reach an ~8.4x better Energy-Delay-Product. Meanwhile, ACCO improves ~20% in layer-fusion optimals compared to the SotA DF exploration toolchain. When jointly optimizing ST-CNN on the temporal and spatial dimension, ACCO's scheduling outcomes are on average 19x faster and 37x more energy-efficient than spatial DF schemes. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Journal ref: 2023 IEEE 41st International Conference on Computer Design (ICCD), Washington, DC, USA, 2023, pp. 391-398

arXiv:2309.07798 [pdf, other]

Enhancing Performance, Calibration Time and Efficiency in Brain-Machine Interfaces through Transfer Learning and Wearable EEG Technology

Authors: Xiaying Wang, Lan Mei, Victor Kartsch, Andrea Cossettini, Luca Benini

Abstract: Brain-machine interfaces (BMIs) have emerged as a transformative force in assistive technologies, empowering individuals with motor impairments by enabling device control and facilitating functional recovery. However, the persistent challenge of inter-session variability poses a significant hurdle, requiring time-consuming calibration at every new use. Compounding this issue, the low comfort level… ▽ More Brain-machine interfaces (BMIs) have emerged as a transformative force in assistive technologies, empowering individuals with motor impairments by enabling device control and facilitating functional recovery. However, the persistent challenge of inter-session variability poses a significant hurdle, requiring time-consuming calibration at every new use. Compounding this issue, the low comfort level of current devices further restricts their usage. To address these challenges, we propose a comprehensive solution that combines a tiny CNN-based Transfer Learning (TL) approach with a comfortable, wearable EEG headband. The novel wearable EEG device features soft dry electrodes placed on the headband and is capable of on-board processing. We acquire multiple sessions of motor-movement EEG data and achieve up to 96% inter-session accuracy using TL, greatly reducing the calibration time and improving usability. By executing the inference on the edge every 100ms, the system is estimated to achieve 30h of battery life. The comfortable BMI setup with tiny CNN and TL paves the way to future on-device continual learning, essential for tackling inter-session variability and improving usability. △ Less

Submitted 27 March, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

arXiv:2307.07366 [pdf, other]

Reconstructing Three-decade Global Fine-Grained Nighttime Light Observations by a New Super-Resolution Framework

Authors: Jinyu Guo, Feng Zhang, Hang Zhao, Baoxiang Pan, Linlu Mei

Abstract: Satellite-collected nighttime light provides a unique perspective on human activities, including urbanization, population growth, and epidemics. Yet, long-term and fine-grained nighttime light observations are lacking, leaving the analysis and applications of decades of light changes in urban facilities undeveloped. To fill this gap, we developed an innovative framework and used it to design a new… ▽ More Satellite-collected nighttime light provides a unique perspective on human activities, including urbanization, population growth, and epidemics. Yet, long-term and fine-grained nighttime light observations are lacking, leaving the analysis and applications of decades of light changes in urban facilities undeveloped. To fill this gap, we developed an innovative framework and used it to design a new super-resolution model that reconstructs low-resolution nighttime light data into high resolution. The validation of one billion data points shows that the correlation coefficient of our model at the global scale reaches 0.873, which is significantly higher than that of other existing models (maximum = 0.713). Our model also outperforms existing models at the national and urban scales. Furthermore, through an inspection of airports and roads, only our model's image details can reveal the historical development of these facilities. We provide the long-term and fine-grained nighttime light observations to promote research on human activities. The dataset is available at \url{https://doi.org/10.5281/zenodo.7859205}. △ Less

Submitted 14 July, 2023; originally announced July 2023.

arXiv:2305.03061 [pdf, other]

Mining fMRI Dynamics with Parcellation Prior for Brain Disease Diagnosis

Authors: Xiaozhao Liu, Mianxin Liu, Lang Mei, Yuyao Zhang, Feng Shi, Han Zhang, Dinggang Shen

Abstract: To characterize atypical brain dynamics under diseases, prevalent studies investigate functional magnetic resonance imaging (fMRI). However, most of the existing analyses compress rich spatial-temporal information as the brain functional networks (BFNs) and directly investigate the whole-brain network without neurological priors about functional subnetworks. We thus propose a novel graph learning… ▽ More To characterize atypical brain dynamics under diseases, prevalent studies investigate functional magnetic resonance imaging (fMRI). However, most of the existing analyses compress rich spatial-temporal information as the brain functional networks (BFNs) and directly investigate the whole-brain network without neurological priors about functional subnetworks. We thus propose a novel graph learning framework to mine fMRI signals with topological priors from brain parcellation for disease diagnosis. Specifically, we 1) detect diagnosis-related temporal features using a "Transformer" for a higher-level BFN construction, and process it with a following graph convolutional network, and 2) apply an attention-based multiple instance learning strategy to emphasize the disease-affected subnetworks to further enhance the diagnosis performance and interpretability. Experiments demonstrate higher effectiveness of our method than compared methods in the diagnosis of early mild cognitive impairment. More importantly, our method is capable of localizing crucial brain subnetworks during the diagnosis, providing insights into the pathogenic source of mild cognitive impairment. △ Less

Submitted 4 May, 2023; originally announced May 2023.

Comments: 5 pages, 2 figures, conference paper, accepted by IEEE International Symposium on Biomedical Imaging (ISBI) 2023

arXiv:2211.17048 [pdf, other]

SNAF: Sparse-view CBCT Reconstruction with Neural Attenuation Fields

Authors: Yu Fang, Lanzhuju Mei, Changjian Li, Yuan Liu, Wenping Wang, Zhiming Cui, Dinggang Shen

Abstract: Cone beam computed tomography (CBCT) has been widely used in clinical practice, especially in dental clinics, while the radiation dose of X-rays when capturing has been a long concern in CBCT imaging. Several research works have been proposed to reconstruct high-quality CBCT images from sparse-view 2D projections, but the current state-of-the-arts suffer from artifacts and the lack of fine details… ▽ More Cone beam computed tomography (CBCT) has been widely used in clinical practice, especially in dental clinics, while the radiation dose of X-rays when capturing has been a long concern in CBCT imaging. Several research works have been proposed to reconstruct high-quality CBCT images from sparse-view 2D projections, but the current state-of-the-arts suffer from artifacts and the lack of fine details. In this paper, we propose SNAF for sparse-view CBCT reconstruction by learning the neural attenuation fields, where we have invented a novel view augmentation strategy to overcome the challenges introduced by insufficient data from sparse input views. Our approach achieves superior performance in terms of high reconstruction quality (30+ PSNR) with only 20 input views (25 times fewer than clinical collections), which outperforms the state-of-the-arts. We have further conducted comprehensive experiments and ablation analysis to validate the effectiveness of our approach. △ Less

Submitted 30 November, 2022; originally announced November 2022.

arXiv:2106.12864 [pdf, other]

A Systematic Collection of Medical Image Datasets for Deep Learning

Authors: Johann Li, Guangming Zhu, Cong Hua, Mingtao Feng, BasheerBennamoun, Ping Li, Xiaoyuan Lu, Juan Song, Peiyi Shen, Xu Xu, Lin Mei, Liang Zhang, Syed Afaq Ali Shah, Mohammed Bennamoun

Abstract: The astounding success made by artificial intelligence (AI) in healthcare and other fields proves that AI can achieve human-like performance. However, success always comes with challenges. Deep learning algorithms are data-dependent and require large datasets for training. The lack of data in the medical imaging field creates a bottleneck for the application of deep learning to medical image analy… ▽ More The astounding success made by artificial intelligence (AI) in healthcare and other fields proves that AI can achieve human-like performance. However, success always comes with challenges. Deep learning algorithms are data-dependent and require large datasets for training. The lack of data in the medical imaging field creates a bottleneck for the application of deep learning to medical image analysis. Medical image acquisition, annotation, and analysis are costly, and their usage is constrained by ethical restrictions. They also require many resources, such as human expertise and funding. That makes it difficult for non-medical researchers to have access to useful and large medical data. Thus, as comprehensive as possible, this paper provides a collection of medical image datasets with their associated challenges for deep learning research. We have collected information of around three hundred datasets and challenges mainly reported between 2013 and 2020 and categorized them into four categories: head & neck, chest & abdomen, pathology & blood, and ``others''. Our paper has three purposes: 1) to provide a most up to date and complete list that can be used as a universal reference to easily find the datasets for clinical image analysis, 2) to guide researchers on the methodology to test and evaluate their methods' performance and robustness on relevant datasets, 3) to provide a ``route'' to relevant algorithms for the relevant medical topics, and challenge leaderboards. △ Less

Submitted 24 June, 2021; originally announced June 2021.

Comments: This paper has been submitted to one journal

arXiv:2104.12959 [pdf, other]

Realtime Mobile Bandwidth and Handoff Predictions in 4G/5G Networks

Authors: Lifan Mei, Jinrui Gou, Yujin Cai, Houwei Cao, Yong Liu

Abstract: Mobile apps are increasingly relying on high-throughput and low-latency content delivery, while the available bandwidth on wireless access links is inherently time-varying. The handoffs between base stations and access modes due to user mobility present additional challenges to deliver a high level of user Quality-of-Experience (QoE). The ability to predict the available bandwidth and the upcoming… ▽ More Mobile apps are increasingly relying on high-throughput and low-latency content delivery, while the available bandwidth on wireless access links is inherently time-varying. The handoffs between base stations and access modes due to user mobility present additional challenges to deliver a high level of user Quality-of-Experience (QoE). The ability to predict the available bandwidth and the upcoming handoffs will give applications valuable leeway to make proactive adjustments to avoid significant QoE degradation. In this paper, we explore the possibility and accuracy of realtime mobile bandwidth and handoff predictions in 4G/LTE and 5G networks. Towards this goal, we collect long consecutive traces with rich bandwidth, channel, and context information from public transportation systems. We develop Recurrent Neural Network models to mine the temporal patterns of bandwidth evolution in fixed-route mobility scenarios. Our models consistently outperform the conventional univariate and multivariate bandwidth prediction models. For 4G \& 5G co-existing networks, we propose a new problem of handoff prediction between 4G and 5G, which is important for low-latency applications like self-driving strategy in realistic 5G scenarios. We develop classification and regression based prediction models, which achieve more than 80\% accuracy in predicting 4G and 5G handoffs in a recent 5G dataset. △ Less

Submitted 26 April, 2021; originally announced April 2021.

Comments: 12 pages

arXiv:2005.07343 [pdf, other]

Visual Perception Model for Rapid and Adaptive Low-light Image Enhancement

Authors: Xiaoxiao Li, Xiaopeng Guo, Liye Mei, Mingyu Shang, Jie Gao, Maojing Shu, Xiang Wang

Abstract: Low-light image enhancement is a promising solution to tackle the problem of insufficient sensitivity of human vision system (HVS) to perceive information in low light environments. Previous Retinex-based works always accomplish enhancement task by estimating light intensity. Unfortunately, single light intensity modelling is hard to accurately simulate visual perception information, leading to th… ▽ More Low-light image enhancement is a promising solution to tackle the problem of insufficient sensitivity of human vision system (HVS) to perceive information in low light environments. Previous Retinex-based works always accomplish enhancement task by estimating light intensity. Unfortunately, single light intensity modelling is hard to accurately simulate visual perception information, leading to the problems of imbalanced visual photosensitivity and weak adaptivity. To solve these problems, we explore the precise relationship between light source and visual perception and then propose the visual perception (VP) model to acquire a precise mathematical description of visual perception. The core of VP model is to decompose the light source into light intensity and light spatial distribution to describe the perception process of HVS, offering refinement estimation of illumination and reflectance. To reduce complexity of the estimation process, we introduce the rapid and adaptive $\mathbfβ$ and $\mathbfγ$ functions to build an illumination and reflectance estimation scheme. Finally, we present a optimal determination strategy, consisting of a \emph{cycle operation} and a \emph{comparator}. Specifically, the \emph{comparator} is responsible for determining the optimal enhancement results from multiple enhanced results through implementing the \emph{cycle operation}. By coordinating the proposed VP model, illumination and reflectance estimation scheme, and the optimal determination strategy, we propose a rapid and adaptive framework for low-light image enhancement. Extensive experiment results demenstrate that the proposed method achieves better performance in terms of visual comparison, quantitative assessment, and computational efficiency, compared with the currently state-of-the-arts. △ Less

Submitted 14 May, 2020; originally announced May 2020.

Comments: Due to the limitation "The abstract field cannot be longer than 1,920 characters", the abstract here is shorter than that in the PDF file

Showing 1–11 of 11 results for author: Mei, L