-
Aerial Multi-View Stereo via Adaptive Depth Range Inference and Normal Cues
Authors:
Yimei Liu,
Yakun Ju,
Yuan Rao,
Hao Fan,
Junyu Dong,
Feng Gao,
Qian Du
Abstract:
Three-dimensional digital urban reconstruction from multi-view aerial images is a critical application where deep multi-view stereo (MVS) methods outperform traditional techniques. However, existing methods commonly overlook the key differences between aerial and close-range settings, such as varying depth ranges along epipolar lines and insensitive feature-matching associated with low-detailed ae…
▽ More
Three-dimensional digital urban reconstruction from multi-view aerial images is a critical application where deep multi-view stereo (MVS) methods outperform traditional techniques. However, existing methods commonly overlook the key differences between aerial and close-range settings, such as varying depth ranges along epipolar lines and insensitive feature-matching associated with low-detailed aerial images. To address these issues, we propose an Adaptive Depth Range MVS (ADR-MVS), which integrates monocular geometric cues to improve multi-view depth estimation accuracy. The key component of ADR-MVS is the depth range predictor, which generates adaptive range maps from depth and normal estimates using cross-attention discrepancy learning. In the first stage, the range map derived from monocular cues breaks through predefined depth boundaries, improving feature-matching discriminability and mitigating convergence to local optima. In later stages, the inferred range maps are progressively narrowed, ultimately aligning with the cascaded MVS framework for precise depth regression. Moreover, a normal-guided cost aggregation operation is specially devised for aerial stereo images to improve geometric awareness within the cost volume. Finally, we introduce a normal-guided depth refinement module that surpasses existing RGB-guided techniques. Experimental results demonstrate that ADR-MVS achieves state-of-the-art performance on the WHU, LuoJia-MVS, and München datasets, while exhibits superior computational complexity.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
Ola: Pushing the Frontiers of Omni-Modal Language Model
Authors:
Zuyan Liu,
Yuhao Dong,
Jiahui Wang,
Ziwei Liu,
Winston Hu,
Jiwen Lu,
Yongming Rao
Abstract:
Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal Language model that achieves competiti…
▽ More
Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal Language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts, pushing the frontiers of the omni-modal language model to a large extent. We conduct a comprehensive exploration of architectural design, data curation, and training strategies essential for building a robust omni-modal model. Ola incorporates advanced visual understanding and audio recognition capabilities through several critical and effective improvements over mainstream baselines. Moreover, we rethink inter-modal relationships during omni-modal training, emphasizing cross-modal alignment with video as a central bridge, and propose a progressive training pipeline that begins with the most distinct modalities and gradually moves towards closer modality alignment. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field. Model weights, code, and data are open-sourced at https://github.com/Ola-Omni/Ola.
△ Less
Submitted 2 June, 2025; v1 submitted 6 February, 2025;
originally announced February 2025.
-
UMFA: A photorealistic style transfer method based on U-Net and multi-layer feature aggregation
Authors:
D. Y. Rao,
X. J. Wu,
H. Li,
J. Kittler,
T. Y. Xu
Abstract:
In this paper, we propose a photorealistic style transfer network to emphasize the natural effect of photorealistic image stylization. In general, distortion of the image content and lacking of details are two typical issues in the style transfer field. To this end, we design a novel framework employing the U-Net structure to maintain the rich spatial clues, with a multi-layer feature aggregation…
▽ More
In this paper, we propose a photorealistic style transfer network to emphasize the natural effect of photorealistic image stylization. In general, distortion of the image content and lacking of details are two typical issues in the style transfer field. To this end, we design a novel framework employing the U-Net structure to maintain the rich spatial clues, with a multi-layer feature aggregation (MFA) method to simultaneously provide the details obtained by the shallow layers in the stylization processing. In particular, an encoder based on the dense block and a decoder form a symmetrical structure of U-Net are jointly staked to realize an effective feature extraction and image reconstruction. Besides, a transfer module based on MFA and "adaptive instance normalization" (AdaIN) is inserted in the skip connection positions to achieve the stylization. Accordingly, the stylized image possesses the texture of a real photo and preserves rich content details without introducing any mask or post-processing steps. The experimental results on public datasets demonstrate that our method achieves a more faithful structural similarity with a lower style loss, reflecting the effectiveness and merit of our approach.
△ Less
Submitted 13 August, 2021;
originally announced August 2021.
-
NightVision: Generating Nighttime Satellite Imagery from Infra-Red Observations
Authors:
Paula Harder,
William Jones,
Redouane Lguensat,
Shahine Bouabid,
James Fulton,
Dánell Quesada-Chacón,
Aris Marcolongo,
Sofija Stefanović,
Yuhan Rao,
Peter Manshausen,
Duncan Watson-Parris
Abstract:
The recent explosion in applications of machine learning to satellite imagery often rely on visible images and therefore suffer from a lack of data during the night. The gap can be filled by employing available infra-red observations to generate visible images. This work presents how deep learning can be applied successfully to create those images by using U-Net based architectures. The proposed m…
▽ More
The recent explosion in applications of machine learning to satellite imagery often rely on visible images and therefore suffer from a lack of data during the night. The gap can be filled by employing available infra-red observations to generate visible images. This work presents how deep learning can be applied successfully to create those images by using U-Net based architectures. The proposed methods show promising results, achieving a structural similarity index (SSIM) up to 86\% on an independent test set and providing visually convincing output images, generated from infra-red observations.
△ Less
Submitted 8 December, 2020; v1 submitted 13 November, 2020;
originally announced November 2020.
-
ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders
Authors:
Yu Gu,
Xiang Yin,
Yonghui Rao,
Yuan Wan,
Benlai Tang,
Yang Zhang,
Jitong Chen,
Yuxuan Wang,
Zejun Ma
Abstract:
This paper presents ByteSing, a Chinese singing voice synthesis (SVS) system based on duration allocated Tacotron-like acoustic models and WaveRNN neural vocoders. Different from the conventional SVS models, the proposed ByteSing employs Tacotron-like encoder-decoder structures as the acoustic models, in which the CBHG models and recurrent neural networks (RNNs) are explored as encoders and decode…
▽ More
This paper presents ByteSing, a Chinese singing voice synthesis (SVS) system based on duration allocated Tacotron-like acoustic models and WaveRNN neural vocoders. Different from the conventional SVS models, the proposed ByteSing employs Tacotron-like encoder-decoder structures as the acoustic models, in which the CBHG models and recurrent neural networks (RNNs) are explored as encoders and decoders respectively. Meanwhile an auxiliary phoneme duration prediction model is utilized to expand the input sequence, which can enhance the model controllable capacity, model stability and tempo prediction accuracy. WaveRNN neural vocoders are also adopted as neural vocoders to further improve the voice quality of synthesized songs. Both objective and subjective experimental results prove that the SVS method proposed in this paper can produce quite natural, expressive and high-fidelity songs by improving the pitch and spectrogram prediction accuracy and the models using attention mechanism can achieve best performance.
△ Less
Submitted 24 January, 2021; v1 submitted 23 April, 2020;
originally announced April 2020.
-
Structure-Preserving Super Resolution with Gradient Guidance
Authors:
Cheng Ma,
Yongming Rao,
Yean Cheng,
Ce Chen,
Jiwen Lu,
Jie Zhou
Abstract:
Structures matter in single image super resolution (SISR). Recent studies benefiting from generative adversarial network (GAN) have promoted the development of SISR by recovering photo-realistic images. However, there are always undesired structural distortions in the recovered images. In this paper, we propose a structure-preserving super resolution method to alleviate the above issue while maint…
▽ More
Structures matter in single image super resolution (SISR). Recent studies benefiting from generative adversarial network (GAN) have promoted the development of SISR by recovering photo-realistic images. However, there are always undesired structural distortions in the recovered images. In this paper, we propose a structure-preserving super resolution method to alleviate the above issue while maintaining the merits of GAN-based methods to generate perceptual-pleasant details. Specifically, we exploit gradient maps of images to guide the recovery in two aspects. On the one hand, we restore high-resolution gradient maps by a gradient branch to provide additional structure priors for the SR process. On the other hand, we propose a gradient loss which imposes a second-order restriction on the super-resolved images. Along with the previous image-space loss functions, the gradient-space objectives help generative networks concentrate more on geometric structures. Moreover, our method is model-agnostic, which can be potentially used for off-the-shelf SR networks. Experimental results show that we achieve the best PI and LPIPS performance and meanwhile comparable PSNR and SSIM compared with state-of-the-art perceptual-driven SR methods. Visual results demonstrate our superiority in restoring structures while generating natural SR images.
△ Less
Submitted 29 March, 2020;
originally announced March 2020.
-
Deep Fault Diagnosis for Rotating Machinery with Scarce Labeled Samples
Authors:
Jing Zhang,
Jing Tian,
Tao Wen,
Xiaohui Yang,
Yong Rao,
Xiaobin Xu
Abstract:
Early and accurately detecting faults in rotating machinery is crucial for operation safety of the modern manufacturing system. In this paper, we proposed a novel Deep fault diagnosis (DFD) method for rotating machinery with scarce labeled samples. DFD tackles the challenging problem by transferring knowledge from shallow models, which is based on the idea that shallow models trained with differen…
▽ More
Early and accurately detecting faults in rotating machinery is crucial for operation safety of the modern manufacturing system. In this paper, we proposed a novel Deep fault diagnosis (DFD) method for rotating machinery with scarce labeled samples. DFD tackles the challenging problem by transferring knowledge from shallow models, which is based on the idea that shallow models trained with different hand-crafted features can reveal the latent prior knowledge and diagnostic expertise and have good generalization ability even with scarce labeled samples. DFD can be divided into three phases. First, a spectrogram of the raw vibration signal is calculated by applying a Short-time Fourier transform (STFT). From those spectrograms, discriminative time-frequency domain features can be extracted and used to form a feature pool. Then, several candidate Support vector machine (SVM) models are trained with different combinations of features in the feature pool with scarce labeled samples. By evaluating the pretrained SVM models on the validation set, the most discriminative features and best-performed SVM models can be selected, which are used to make predictions on the unlabeled samples. The predicted labels reserve the expert knowledge originally carried by the SVM model. They are combined together with the scarce fine labeled samples to form an Augmented training set (ATS). Finally, a novel 2D deep Convolutional neural network (CNN) model is trained on the ATS to learn more discriminative features and a better classifier. Experimental results on two fault diagnosis datasets demonstrate the effectiveness of the proposed DFD, which achieves better performance than SVM models and the vanilla deep CNN model trained on scarce labeled samples. Moreover, it is computationally efficient and is promising for real-time rotating machinery fault diagnosis.
△ Less
Submitted 13 July, 2019;
originally announced July 2019.
-
Rayleigh fading suppression in one-dimension optical scatters
Authors:
Shengtao Lin,
Zinan Wang,
Ji Xiong,
Yun Fu,
Jialin Jiang,
Yue Wu,
Yongxiang Chen,
Chongyu Lu,
Yunjiang Rao
Abstract:
Highly coherent wave is favorable for applications in which phase retrieval is necessary, yet a high coherent wave is prone to encounter Rayleigh fading phenomenon as it passes through a medium of random scatters. As an exemplary case, phase-sensitive optical time-domain reflectometry (Φ-OTDR) utilizes coherent interference of backscattering light along a fiber to achieve ultra-sensitive acoustic…
▽ More
Highly coherent wave is favorable for applications in which phase retrieval is necessary, yet a high coherent wave is prone to encounter Rayleigh fading phenomenon as it passes through a medium of random scatters. As an exemplary case, phase-sensitive optical time-domain reflectometry (Φ-OTDR) utilizes coherent interference of backscattering light along a fiber to achieve ultra-sensitive acoustic sensing, but sensing locations with fading won't be functional. Apart from the sensing domain, fading is also ubiquitous in optical imaging and wireless telecommunication, therefore it is of great interest. In this paper, we theoretically describe and experimentally verify how the fading phenomena in one-dimension optical scatters will be suppressed with arbitrary number of independent probing channels. We initially theoretically explained why fading would cause severe noise in the demodulated phase of Φ-OTDR; then M-degree summation of incoherent scattered light-waves is studied for the purpose of eliminating fading. Finally, the gain of the retrieved phase signal-to-noise-ratio and its fluctuations were analytically derived and experimentally verified. This work provides a guideline for fading elimination in one-dimension optical scatters, and it also provides insight for optical imaging and wireless telecommunication.
△ Less
Submitted 8 December, 2018;
originally announced December 2018.
-
Pansharpening via Detail Injection Based Convolutional Neural Networks
Authors:
Lin He,
Yizhou Rao,
Jun Li,
Antonio Plaza,
Jiawei Zhu
Abstract:
Pansharpening aims to fuse a multispectral (MS) image with an associated panchromatic (PAN) image, producing a composite image with the spectral resolution of the former and the spatial resolution of the latter. Traditional pansharpening methods can be ascribed to a unified detail injection context, which views the injected MS details as the integration of PAN details and band-wise injection gains…
▽ More
Pansharpening aims to fuse a multispectral (MS) image with an associated panchromatic (PAN) image, producing a composite image with the spectral resolution of the former and the spatial resolution of the latter. Traditional pansharpening methods can be ascribed to a unified detail injection context, which views the injected MS details as the integration of PAN details and band-wise injection gains. In this work, we design a detail injection based CNN (DiCNN) framework for pansharpening, with the MS details being directly formulated in end-to-end manners, where the first detail injection based CNN (DiCNN1) mines MS details through the PAN image and the MS image, and the second one (DiCNN2) utilizes only the PAN image. The main advantage of the proposed DiCNNs is that they provide explicit physical interpretations and can achieve fast convergence while achieving high pansharpening quality. Furthermore, the effectiveness of the proposed approaches is also analyzed from a relatively theoretical point of view. Our methods are evaluated via experiments on real-world MS image datasets, achieving excellent performance when compared to other state-of-the-art methods.
△ Less
Submitted 22 June, 2018;
originally announced June 2018.