Skip to main content

Showing 1–26 of 26 results for author: Dai, D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.13993   

    eess.IV cs.AI cs.CV

    OrthoInsight: Rib Fracture Diagnosis and Report Generation Based on Multi-Modal Large Models

    Authors: Ningyong Wu, Jinzhi Wang, Wenhong Zhao, Chenzhan Yu, Zhigang Xiu, Duwei Dai

    Abstract: The growing volume of medical imaging data has increased the need for automated diagnostic tools, especially for musculoskeletal injuries like rib fractures, commonly detected via CT scans. Manual interpretation is time-consuming and error-prone. We propose OrthoInsight, a multi-modal deep learning framework for rib fracture diagnosis and report generation. It integrates a YOLOv9 model for fractur… ▽ More

    Submitted 26 July, 2025; v1 submitted 18 July, 2025; originally announced July 2025.

    Comments: This paper contains significant issues in the data preprocessing stage, which led to non-reproducible results. We are currently correcting the errors and will submit a revised version in the future.

  2. arXiv:2507.12938  [pdf, ps, other

    eess.IV cs.CV

    Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion

    Authors: Caixia Dong, Duwei Dai, Xinyi Han, Fan Liu, Xu Yang, Zongfang Li, Songhua Xu

    Abstract: Accurate coronary artery segmentation is critical for computeraided diagnosis of coronary artery disease (CAD), yet it remains challenging due to the small size, complex morphology, and low contrast with surrounding tissues. To address these challenges, we propose a novel segmentation framework that leverages the power of vision foundation models (VFMs) through a parallel encoding architecture. Sp… ▽ More

    Submitted 17 July, 2025; originally announced July 2025.

    Journal ref: MICCAI2025

  3. arXiv:2505.12089  [pdf, ps, other

    eess.IV cs.AI cs.CV

    NTIRE 2025 Challenge on Efficient Burst HDR and Restoration: Datasets, Methods, and Results

    Authors: Sangmin Lee, Eunpil Park, Angel Canelo, Hyunhee Park, Youngjo Kim, Hyung-Ju Chun, Xin Jin, Chongyi Li, Chun-Le Guo, Radu Timofte, Qi Wu, Tianheng Qiu, Yuchun Dong, Shenglin Ding, Guanghua Pan, Weiyu Zhou, Tao Hu, Yixu Feng, Duwei Dai, Yu Cao, Peng Wu, Wei Dong, Yanning Zhang, Qingsen Yan, Simon J. Larsen , et al. (11 additional authors not shown)

    Abstract: This paper reviews the NTIRE 2025 Efficient Burst HDR and Restoration Challenge, which aims to advance efficient multi-frame high dynamic range (HDR) and restoration techniques. The challenge is based on a novel RAW multi-frame fusion dataset, comprising nine noisy and misaligned RAW frames with various exposure levels per scene. Participants were tasked with developing solutions capable of effect… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

  4. arXiv:2503.17261  [pdf, other

    eess.IV cs.CV

    Cross-Modal Interactive Perception Network with Mamba for Lung Tumor Segmentation in PET-CT Images

    Authors: Jie Mei, Chenyu Lin, Yu Qiu, Yaonan Wang, Hui Zhang, Ziyang Wang, Dong Dai

    Abstract: Lung cancer is a leading cause of cancer-related deaths globally. PET-CT is crucial for imaging lung tumors, providing essential metabolic and anatomical information, while it faces challenges such as poor image quality, motion artifacts, and complex tumor morphology. Deep learning-based models are expected to address these problems, however, existing small-scale and private datasets limit signifi… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

    Comments: Accepted to CVPR 2025

  5. arXiv:2501.01103  [pdf, other

    eess.AS cs.AI cs.SD

    learning discriminative features from spectrograms using center loss for speech emotion recognition

    Authors: Dongyang Dai, Zhiyong Wu, Runnan Li, Xixin Wu, Jia Jia, Helen Meng

    Abstract: Identifying the emotional state from speech is essential for the natural interaction of the machine with the speaker. However, extracting effective features for emotion recognition is difficult, as emotions are ambiguous. We propose a novel approach to learn discriminative features from variable length spectrograms for emotion recognition by cooperating softmax cross-entropy loss and center loss t… ▽ More

    Submitted 2 January, 2025; originally announced January 2025.

    Comments: Accepted at ICASSP 2019

    Journal ref: Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) 2019, pp. 7405-7409

  6. arXiv:2501.01102  [pdf, other

    eess.AS cs.AI cs.SD

    Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT

    Authors: Dongyang Dai, Zhiyong Wu, Shiyin Kang, Xixin Wu, Jia Jia, Dan Su, Dong Yu, Helen Meng

    Abstract: Grapheme-to-phoneme (G2P) conversion serves as an essential component in Chinese Mandarin text-to-speech (TTS) system, where polyphone disambiguation is the core issue. In this paper, we propose an end-to-end framework to predict the pronunciation of a polyphonic character, which accepts sentence containing polyphonic character as input in the form of Chinese character sequence without the necessi… ▽ More

    Submitted 2 January, 2025; originally announced January 2025.

    Comments: Accepted at INTERSPEECH 2019

    Journal ref: Proc. Interspeech 2019, pp. 2090-2094

  7. arXiv:2411.17420  [pdf, other

    cs.CE eess.IV

    Cross-modal Medical Image Generation Based on Pyramid Convolutional Attention Network

    Authors: Fuyou Mao, Lixin Lin, Ming Jiang, Dong Dai, Chao Yang, Hao Zhang, Yan Tang

    Abstract: The integration of multimodal medical imaging can provide complementary and comprehensive information for the diagnosis of Alzheimer's disease (AD). However, in clinical practice, since positron emission tomography (PET) is often missing, multimodal images might be incomplete. To address this problem, we propose a method that can efficiently utilize structural magnetic resonance imaging (sMRI) ima… ▽ More

    Submitted 28 November, 2024; v1 submitted 26 November, 2024; originally announced November 2024.

    Comments: 18 pages, 6 figures, Machine Vision and Applications

  8. arXiv:2408.15916  [pdf, other

    eess.AS cs.LG cs.SD

    Multi-modal Adversarial Training for Zero-Shot Voice Cloning

    Authors: John Janiczek, Dading Chong, Dongyang Dai, Arlo Faria, Chao Wang, Tao Wang, Yuzong Liu

    Abstract: A text-to-speech (TTS) model trained to reconstruct speech given text tends towards predictions that are close to the average characteristics of a dataset, failing to model the variations that make human speech sound natural. This problem is magnified for zero-shot voice cloning, a task that requires training data with high variance in speaking styles. We build off of recent works which have used… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

    Comments: Accepted at INTERSPEECH 2024

  9. arXiv:2407.16634  [pdf, other

    eess.IV cs.AI cs.CV cs.HC

    Knowledge-driven AI-generated data for accurate and interpretable breast ultrasound diagnoses

    Authors: Haojun Yu, Youcheng Li, Nan Zhang, Zihan Niu, Xuantong Gong, Yanwen Luo, Quanlin Wu, Wangyan Qin, Mengyuan Zhou, Jie Han, Jia Tao, Ziwei Zhao, Di Dai, Di He, Dong Wang, Binghui Tang, Ling Huo, Qingli Zhu, Yong Wang, Liwei Wang

    Abstract: Data-driven deep learning models have shown great capabilities to assist radiologists in breast ultrasound (US) diagnoses. However, their effectiveness is limited by the long-tail distribution of training data, which leads to inaccuracies in rare cases. In this study, we address a long-standing challenge of improving the diagnostic model performance on rare cases using long-tailed data. Specifical… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

  10. arXiv:2403.05010  [pdf, other

    cs.SD cs.AI eess.AS

    RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction

    Authors: Peng Liu, Dongyang Dai, Zhiyong Wu

    Abstract: Recent advancements in generative modeling have significantly enhanced the reconstruction of audio waveforms from various representations. While diffusion models are adept at this task, they are hindered by latency issues due to their operation at the individual sample point level and the need for numerous sampling steps. In this study, we introduce RFWave, a cutting-edge multi-band Rectified Flow… ▽ More

    Submitted 6 October, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

  11. arXiv:2403.02894  [pdf

    eess.SP

    DIFNet: SAR RFI suppression based on domain invariant features

    Authors: Fuping Fang, Wenhao Lv, Dahai Dai

    Abstract: Synthetic aperture radar is a high-resolution two-dimensional imaging radar, however, during the imaging process, SAR is susceptible to intentional and unintentional interference, with radio frequency interference (RFI) being the most common type, leading to a severe degradation in image quality. Although inpainting networks have achieved excellent results, their generalization is unclear, and whe… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

    Comments: five pages

  12. arXiv:2301.06622  [pdf, other

    cs.DC eess.SY

    IOPathTune: Adaptive Online Parameter Tuning for Parallel File System I/O Path

    Authors: Md. Hasanur Rashid, Youbiao He, Forrest Sheng Bao, Dong Dai

    Abstract: Parallel file systems contain complicated I/O paths from clients to storage servers. An efficient I/O path requires proper settings of multiple parameters, as the default settings often fail to deliver optimal performance, especially for diverse workloads in the HPC environment. Existing tuning strategies have shortcomings in being adaptive, timely, and flexible. We propose IOPathTune, which adapt… ▽ More

    Submitted 16 January, 2023; originally announced January 2023.

  13. arXiv:2212.08558  [pdf, other

    cs.RO cs.CV eess.SP

    Simulating Road Spray Effects in Automotive Lidar Sensor Models

    Authors: Clemens Linnhoff, Dominik Scheuble, Mario Bijelic, Lukas Elster, Philipp Rosenberger, Werner Ritter, Dengxin Dai, Hermann Winner

    Abstract: Modeling perception sensors is key for simulation based testing of automated driving functions. Beyond weather conditions themselves, sensors are also subjected to object dependent environmental influences like tire spray caused by vehicles moving on wet pavement. In this work, a novel modeling approach for spray in lidar data is introduced. The model conforms to the Open Simulation Interface (OSI… ▽ More

    Submitted 16 December, 2022; originally announced December 2022.

    Comments: Submitted to IEEE Sensors Journal

  14. arXiv:2110.03347  [pdf, ps, other

    eess.AS cs.HC cs.SD

    Cloning one's voice using very limited data in the wild

    Authors: Dongyang Dai, Yuanzhe Chen, Li Chen, Ming Tu, Lu Liu, Rui Xia, Qiao Tian, Yuping Wang, Yuxuan Wang

    Abstract: With the increasing popularity of speech synthesis products, the industry has put forward more requirements for personalized speech synthesis: (1) How to use low-resource, easily accessible data to clone a person's voice. (2) How to clone a person's voice while controlling the style and prosody. To solve the above two problems, we proposed the Hieratron model framework in which the prosody and tim… ▽ More

    Submitted 8 October, 2021; v1 submitted 7 October, 2021; originally announced October 2021.

  15. arXiv:2109.02763  [pdf, other

    cs.SD cs.CV eess.AS

    Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural Sounds

    Authors: Dengxin Dai, Arun Balajee Vasudevan, Jiri Matas, Luc Van Gool

    Abstract: Humans can robustly recognize and localize objects by using visual and/or auditory cues. While machines are able to do the same with visual data already, less work has been done with sounds. This work develops an approach for scene understanding purely based on binaural sounds. The considered tasks include predicting the semantic masks of sound-making objects, the motion of sound-making objects, a… ▽ More

    Submitted 27 February, 2022; v1 submitted 6 September, 2021; originally announced September 2021.

    Comments: Accepted by TPAMI. arXiv admin note: substantial text overlap with arXiv:2003.04210

  16. arXiv:2012.11174  [pdf, other

    eess.AS cs.AI

    Unsupervised Cross-Lingual Speech Emotion Recognition Using DomainAdversarial Neural Network

    Authors: Xiong Cai, Zhiyong Wu, Kuo Zhong, Bin Su, Dongyang Dai, Helen Meng

    Abstract: By using deep learning approaches, Speech Emotion Recog-nition (SER) on a single domain has achieved many excellentresults. However, cross-domain SER is still a challenging taskdue to the distribution shift between source and target domains.In this work, we propose a Domain Adversarial Neural Net-work (DANN) based approach to mitigate this distribution shiftproblem for cross-lingual SER. Specifica… ▽ More

    Submitted 21 December, 2020; originally announced December 2020.

    Comments: This paper has been accepted by ISCSLP2021

    ACM Class: I.2

  17. arXiv:2010.13350  [pdf, other

    eess.AS cs.SD

    Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition

    Authors: Xiong Cai, Dongyang Dai, Zhiyong Wu, Xiang Li, Jingbei Li, Helen Meng

    Abstract: Neural text-to-speech (TTS) approaches generally require a huge number of high quality speech data, which makes it difficult to obtain such a dataset with extra emotion labels. In this paper, we propose a novel approach for emotional TTS synthesis on a TTS dataset without emotion labels. Specifically, our proposed method consists of a cross-domain speech emotion recognition (SER) model and an emot… ▽ More

    Submitted 17 January, 2021; v1 submitted 26 October, 2020; originally announced October 2020.

    Comments: icassp2021 final version

    MSC Class: I.2

  18. arXiv:2006.11610  [pdf, other

    eess.AS cs.LG cs.MM cs.SD

    Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams

    Authors: Huirong Huang, Zhiyong Wu, Shiyin Kang, Dongyang Dai, Jia Jia, Tianxiao Fu, Deyi Tuo, Guangzhi Lei, Peng Liu, Dan Su, Dong Yu, Helen Meng

    Abstract: Generating 3D speech-driven talking head has received more and more attention in recent years. Recent approaches mainly have following limitations: 1) most speaker-independent methods need handcrafted features that are time-consuming to design or unreliable; 2) there is no convincing method to support multilingual or mixlingual speech as input. In this work, we propose a novel approach using phone… ▽ More

    Submitted 20 June, 2020; originally announced June 2020.

    Comments: 5 pages, 5 figures

  19. arXiv:2006.04648  [pdf, other

    cs.CV cs.LG eess.IV

    Graph-based Visual-Semantic Entanglement Network for Zero-shot Image Recognition

    Authors: Yang Hu, Guihua Wen, Adriane Chapman, Pei Yang, Mingnan Luo, Yingxue Xu, Dan Dai, Wendy Hall

    Abstract: Zero-shot learning uses semantic attributes to connect the search space of unseen objects. In recent years, although the deep convolutional network brings powerful visual modeling capabilities to the ZSL task, its visual features have severe pattern inertia and lack of representation of semantic relationships, which leads to severe bias and ambiguity. In response to this, we propose the Graph-base… ▽ More

    Submitted 11 June, 2021; v1 submitted 8 June, 2020; originally announced June 2020.

    Comments: 15 pages, 11 figures, on IEEE Transactions on Multimedia

    Journal ref: [J]. IEEE Transactions on Multimedia, 2021

  20. arXiv:2005.12531  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement

    Authors: Dongyang Dai, Li Chen, Yuping Wang, Mu Wang, Rui Xia, Xuchen Song, Zhiyong Wu, Yuxuan Wang

    Abstract: With the popularity of deep neural network, speech synthesis task has achieved significant improvements based on the end-to-end encoder-decoder framework in the recent days. More and more applications relying on speech synthesis technology have been widely used in our daily life. Robust speech synthesis model depends on high quality and customized data which needs lots of collecting efforts. It is… ▽ More

    Submitted 22 October, 2020; v1 submitted 26 May, 2020; originally announced May 2020.

  21. arXiv:2004.01643  [pdf, other

    cs.CV cs.LG eess.IV

    Quantifying Data Augmentation for LiDAR based 3D Object Detection

    Authors: Martin Hahner, Dengxin Dai, Alexander Liniger, Luc Van Gool

    Abstract: In this work, we shed light on different data augmentation techniques commonly used in Light Detection and Ranging (LiDAR) based 3D Object Detection. For the bulk of our experiments, we utilize the well known PointPillars pipeline and the well established KITTI dataset. We investigate a variety of global and local augmentation techniques, where global augmentation techniques are applied to the ent… ▽ More

    Submitted 29 July, 2022; v1 submitted 3 April, 2020; originally announced April 2020.

    Comments: 2022 Update

  22. arXiv:2003.04210  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

    Authors: Arun Balajee Vasudevan, Dengxin Dai, Luc Van Gool

    Abstract: Humans can robustly recognize and localize objects by integrating visual and auditory cues. While machines are able to do the same now with images, less work has been done with sounds. This work develops an approach for dense semantic labelling of sound-making objects, purely based on binaural sounds. We propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight… ▽ More

    Submitted 9 March, 2020; originally announced March 2020.

    Comments: Project page: https://www.trace.ethz.ch/publications/2020/sound_perception/index.html

  23. arXiv:2003.00636  [pdf, other

    cs.CV cs.LG eess.IV

    Matching Neuromorphic Events and Color Images via Adversarial Learning

    Authors: Fang Xu, Shijie Lin, Wen Yang, Lei Yu, Dengxin Dai, Gui-song Xia

    Abstract: The event camera has appealing properties: high dynamic range, low latency, low power consumption and low memory usage, and thus provides complementariness to conventional frame-based cameras. It only captures the dynamics of a scene and is able to capture almost "continuous" motion. However, different from frame-based camera that reflects the whole appearance as scenes are, the event camera casts… ▽ More

    Submitted 1 March, 2020; originally announced March 2020.

  24. arXiv:2001.02613  [pdf, other

    cs.CV cs.LG cs.RO eess.IV

    Don't Forget The Past: Recurrent Depth Estimation from Monocular Video

    Authors: Vaishakh Patil, Wouter Van Gansbeke, Dengxin Dai, Luc Van Gool

    Abstract: Autonomous cars need continuously updated depth information. Thus far, depth is mostly estimated independently for a single frame at a time, even if the method starts from video input. Our method produces a time series of depth maps, which makes it an ideal candidate for online learning approaches. In particular, we put three different types of depth estimation (supervised depth prediction, self-s… ▽ More

    Submitted 28 July, 2020; v1 submitted 8 January, 2020; originally announced January 2020.

    Comments: Please refer to our webpage for details https://www.trace.ethz.ch/publications/2020/rec_depth_estimation/

  25. arXiv:1907.05738  [pdf, other

    cs.CV cs.RO eess.SY

    Learning a Curve Guardian for Motorcycles

    Authors: Simon Hecker, Alexander Liniger, Henrik Maurenbrecher, Dengxin Dai, Luc Van Gool

    Abstract: Up to 17% of all motorcycle accidents occur when the rider is maneuvering through a curve and the main cause of curve accidents can be attributed to inappropriate speed and wrong intra-lane position of the motorcycle. Existing curve warning systems lack crucial state estimation components and do not scale well. We propose a new type of road curvature warning system for motorcycles, combining the l… ▽ More

    Submitted 12 July, 2019; originally announced July 2019.

    Comments: 8 pages, to be presented at IEEE-ITSC 2019

  26. arXiv:1807.08312  [pdf, ps, other

    eess.AS cs.AI cs.LG cs.SD

    Unified Hypersphere Embedding for Speaker Recognition

    Authors: Mahdi Hajibabaei, Dengxin Dai

    Abstract: Incremental improvements in accuracy of Convolutional Neural Networks are usually achieved through use of deeper and more complex models trained on larger datasets. However, enlarging dataset and models increases the computation and storage costs and cannot be done indefinitely. In this work, we seek to improve the identification and verification accuracy of a text-independent speaker recognition… ▽ More

    Submitted 22 July, 2018; originally announced July 2018.