Skip to main content

Showing 1–50 of 285 results for author: Wang, M

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.14771  [pdf, other

    eess.IV cs.CV cs.ET cs.MM

    Empirical Studies of Large Scale Environment Scanning by Consumer Electronics

    Authors: Mengyuan Wang, Yang Liu, Haopeng Wang, Haiwei Dong, Abdulmotaleb El Saddik

    Abstract: This paper presents an empirical evaluation of the Matterport Pro3, a consumer-grade 3D scanning device, for large-scale environment reconstruction. We conduct detailed scanning (1,099 scanning points) of a six-floor building (17,567 square meters) and assess the device's effectiveness, limitations, and performance enhancements in diverse scenarios. Challenges encountered during the scanning are a… ▽ More

    Submitted 27 March, 2025; originally announced June 2025.

    Comments: Accepted by IEEE Consumer Electronics Magazine

  2. arXiv:2506.13094  [pdf, ps, other

    eess.IV

    MorphSAM: Learning the Morphological Prompts from Atlases for Spine Image Segmentation

    Authors: Dingwei Fan, Junyong Zhao, Chunlin Li, Xinlong Wang, Ronghan Zhang, Mingliang Wang, Qi Zhu, Haipeng Si, Daoqiang Zhang, Liang Sun

    Abstract: Spine image segmentation is crucial for clinical diagnosis and treatment of spine diseases. The complex structure of the spine and the high morphological similarity between individual vertebrae and adjacent intervertebral discs make accurate spine segmentation a challenging task. Although the Segment Anything Model (SAM) has been developed, it still struggles to effectively capture and utilize mor… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  3. arXiv:2506.07709  [pdf, ps, other

    eess.IV cs.CV

    Fine-Grained Motion Compression and Selective Temporal Fusion for Neural B-Frame Video Coding

    Authors: Xihua Sheng, Peilin Chen, Meng Wang, Li Zhang, Shiqi Wang, Dapeng Oliver Wu

    Abstract: With the remarkable progress in neural P-frame video coding, neural B-frame coding has recently emerged as a critical research direction. However, most existing neural B-frame codecs directly adopt P-frame coding tools without adequately addressing the unique challenges of B-frame compression, leading to suboptimal performance. To bridge this gap, we propose novel enhancements for motion compressi… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  4. arXiv:2506.03722  [pdf, other

    cs.CL cs.SD eess.AS

    MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition

    Authors: Yinfeng Xia, Huiyan Li, Chenyang Le, Manhong Wang, Yutao Sun, Xingyang Ma, Yanmin Qian

    Abstract: Applying large pre-trained speech models like Whisper has shown promise in reducing training costs for various speech tasks. However, integrating these models into streaming systems remains a challenge. This paper presents a novel prefix-to-prefix training framework for streaming recognition by fine-tuning the Whisper. We introduce the Continuous Integrate-and-Fire mechanism to establish a quasi-m… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: Accepted by Interspeech 2025

  5. arXiv:2506.00811  [pdf, ps, other

    eess.SY eess.SP

    Conceal Truth while Show Fake: T/F Frequency Multiplexing based Anti-Intercepting Transmission

    Authors: Zhisheng Yin, Nan Cheng, Mingjie Wang, Changle Li, Wei Xiang

    Abstract: In wireless communication adversarial scenarios, signals are easily intercepted by non-cooperative parties, exposing the transmission of confidential information. This paper proposes a true-and-false (T/F) frequency multiplexing based anti-intercepting transmission scheme capable of concealing truth while showing fake (CTSF), integrating both offensive and defensive strategies. Specifically, throu… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

  6. arXiv:2505.23782  [pdf, ps, other

    cs.SD cs.AI eess.AS

    4,500 Seconds: Small Data Training Approaches for Deep UAV Audio Classification

    Authors: Andrew P. Berg, Qian Zhang, Mia Y. Wang

    Abstract: Unmanned aerial vehicle (UAV) usage is expected to surge in the coming decade, raising the need for heightened security measures to prevent airspace violations and security threats. This study investigates deep learning approaches to UAV classification focusing on the key issue of data scarcity. To investigate this we opted to train the models using a total of 4,500 seconds of audio samples, evenl… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: Accepted at the 14th International Conference on Data Science, Technology, and Applications (DATA), 2025

  7. arXiv:2505.19486  [pdf, ps, other

    eess.SY cs.LG cs.MA

    VLMLight: Traffic Signal Control via Vision-Language Meta-Control and Dual-Branch Reasoning

    Authors: Maonan Wang, Yirong Chen, Aoyu Pang, Yuxin Cai, Chung Shue Chen, Yuheng Kan, Man-On Pun

    Abstract: Traffic signal control (TSC) is a core challenge in urban mobility, where real-time decisions must balance efficiency and safety. Existing methods - ranging from rule-based heuristics to reinforcement learning (RL) - often struggle to generalize to complex, dynamic, and safety-critical scenarios. We introduce VLMLight, a novel TSC framework that integrates vision-language meta-control with dual-br… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: 25 pages, 15 figures

  8. arXiv:2505.14753  [pdf, ps, other

    eess.IV cs.AI cs.CV

    TransMedSeg: A Transferable Semantic Framework for Semi-Supervised Medical Image Segmentation

    Authors: Mengzhu Wang, Jiao Li, Shanshan Wang, Long Lan, Huibin Tan, Liang Yang, Guoli Yang

    Abstract: Semi-supervised learning (SSL) has achieved significant progress in medical image segmentation (SSMIS) through effective utilization of limited labeled data. While current SSL methods for medical images predominantly rely on consistency regularization and pseudo-labeling, they often overlook transferable semantic relationships across different clinical domains and imaging modalities. To address th… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  9. arXiv:2504.10686  [pdf, other

    cs.CV eess.IV

    The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Hang Guo, Lei Sun, Zongwei Wu, Radu Timofte, Yawei Li, Yao Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Li Song, Hongyuan Yu, Pufan Xu, Cheng Wan, Zhijuan Huang, Peng Guo, Shuyuan Cui, Chenjun Li, Xuehai Hu, Pan Pan, Xin Zhang, Heng Zhang, Qing Luo, Linyan Jiang , et al. (122 additional authors not shown)

    Abstract: This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: Accepted by CVPR2025 NTIRE Workshop, Efficient Super-Resolution Challenge Report. 50 pages

  10. arXiv:2503.17831  [pdf, other

    eess.IV cs.AI cs.CV

    FundusGAN: A Hierarchical Feature-Aware Generative Framework for High-Fidelity Fundus Image Generation

    Authors: Qingshan Hou, Meng Wang, Peng Cao, Zou Ke, Xiaoli Liu, Huazhu Fu, Osmar R. Zaiane

    Abstract: Recent advancements in ophthalmology foundation models such as RetFound have demonstrated remarkable diagnostic capabilities but require massive datasets for effective pre-training, creating significant barriers for development and deployment. To address this critical challenge, we propose FundusGAN, a novel hierarchical feature-aware generative framework specifically designed for high-fidelity fu… ▽ More

    Submitted 22 March, 2025; originally announced March 2025.

  11. arXiv:2503.10045  [pdf, other

    eess.IV cs.CV

    CPLOYO: A Pulmonary Nodule Detection Model with Multi-Scale Feature Fusion and Nonlinear Feature Learning

    Authors: Meng Wang, Zi Yang, Ruifeng Zhao, Yaoting Jiang

    Abstract: The integration of Internet of Things (IoT) technology in pulmonary nodule detection significantly enhances the intelligence and real-time capabilities of the detection system. Currently, lung nodule detection primarily focuses on the identification of solid nodules, but different types of lung nodules correspond to various forms of lung cancer. Multi-type detection contributes to improving the ov… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

  12. arXiv:2503.08015  [pdf, other

    cs.LG eess.SP

    GPT-PPG: A GPT-based Foundation Model for Photoplethysmography Signals

    Authors: Zhaoliang Chen, Cheng Ding, Saurabh Kataria, Runze Yan, Minxiao Wang, Randall Lee, Xiao Hu

    Abstract: This study introduces a novel application of a Generative Pre-trained Transformer (GPT) model tailored for photoplethysmography (PPG) signals, serving as a foundation model for various downstream tasks. Adapting the standard GPT architecture to suit the continuous characteristics of PPG signals, our approach demonstrates promising results. Our models are pre-trained on our extensive dataset that c… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

  13. arXiv:2503.01938  [pdf, other

    eess.IV cs.CV

    A Lightweight Deep Exclusion Unfolding Network for Single Image Reflection Removal

    Authors: Jun-Jie Huang, Tianrui Liu, Zihan Chen, Xinwang Liu, Meng Wang, Pier Luigi Dragotti

    Abstract: Single Image Reflection Removal (SIRR) is a canonical blind source separation problem and refers to the issue of separating a reflection-contaminated image into a transmission and a reflection image. The core challenge lies in minimizing the commonalities among different sources. Existing deep learning approaches either neglect the significance of feature interactions or rely on heuristically desi… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

  14. arXiv:2502.19452  [pdf, other

    eess.IV cs.CV

    SPU-IMR: Self-supervised Arbitrary-scale Point Cloud Upsampling via Iterative Mask-recovery Network

    Authors: Ziming Nie, Qiao Wu, Chenlei Lv, Siwen Quan, Zhaoshuai Qi, Muze Wang, Jiaqi Yang

    Abstract: Point cloud upsampling aims to generate dense and uniformly distributed point sets from sparse point clouds. Existing point cloud upsampling methods typically approach the task as an interpolation problem. They achieve upsampling by performing local interpolation between point clouds or in the feature space, then regressing the interpolated points to appropriate positions. By contrast, our propose… ▽ More

    Submitted 26 February, 2025; originally announced February 2025.

  15. arXiv:2502.17239  [pdf, other

    cs.CL cs.SD eess.AS

    Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction

    Authors: Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, Jianhua Xu, Haoze Sun, Zenan Zhou, Weipeng Chen

    Abstract: We introduce Baichuan-Audio, an end-to-end audio large language model that seamlessly integrates audio understanding and generation. It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities. Baichuan-Audio leverages a pre-trained ASR model, followed by multi-codebook discretization of speech at a frame… ▽ More

    Submitted 24 February, 2025; originally announced February 2025.

  16. arXiv:2502.14534  [pdf

    eess.SP

    Poststroke rehabilitative mechanisms in individualized fatigue level-controlled treadmill training -- a Rat Model Study

    Authors: Yuchen Xu, Yulong Peng, Yuanfa Yao, Xiaoman Fan, Minmin Wang, Feng Gao, Mohamad Sawan, Shaomin Zhang, Xiaoling Hu

    Abstract: Individualized training improved post-stroke motor function rehabilitation efficiency. However, the mechanisms of how individualized training facilitates recovery is not clear. This study explored the cortical and corticomuscular rehabilitative effects in post-stroke motor function recovery during individualized training. Sprague-Dawley rats with intracerebral hemorrhage (ICH) were randomly distri… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

  17. arXiv:2502.12327  [pdf, other

    physics.plasm-ph cs.AI cs.LG eess.SY

    Learning Plasma Dynamics and Robust Rampdown Trajectories with Predict-First Experiments at TCV

    Authors: Allen M. Wang, Alessandro Pau, Cristina Rea, Oswin So, Charles Dawson, Olivier Sauter, Mark D. Boyer, Anna Vu, Cristian Galperti, Chuchu Fan, Antoine Merle, Yoeri Poels, Cristina Venturini, Stefano Marchioni, the TCV Team

    Abstract: The rampdown in tokamak operations is a difficult to simulate phase during which the plasma is often pushed towards multiple instability limits. To address this challenge, and reduce the risk of disrupting operations, we leverage recent advances in Scientific Machine Learning (SciML) to develop a neural state-space model (NSSM) that predicts plasma dynamics during Tokamak à Configuration Variable… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

  18. arXiv:2502.06289  [pdf

    eess.IV cs.AI cs.CV

    Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?

    Authors: Qingshan Hou, Yukun Zhou, Jocelyn Hui Lin Goh, Ke Zou, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Thaddaeus Lo, Xiaofeng Lei, Siegfried K. Wagner, Mark A. Chia, Dawei Yang, Hongyang Jiang, AnRan Ran, Rui Santos, Gabor Mark Somfai, Juan Helen Zhou, Haoyu Chen, Qingyu Chen, Carol Yim-Lui Cheung, Pearse A. Keane, Yih Chung Tham

    Abstract: The advent of foundation models (FMs) is transforming medical domain. In ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4 million natural images and 1.6 million retinal images, has demonstrated high adaptability across clinical applications. Conversely, DINOv2, a general-purpose vision FM pre-trained on 142 million natural images, has shown promise in non-medical domai… ▽ More

    Submitted 10 February, 2025; originally announced February 2025.

  19. A Null Space Compliance Approach for Maintaining Safety and Tracking Performance in Human-Robot Interactions

    Authors: Zi-Qi Yang, Miaomiao Wang, Mehrdad R. Kermani

    Abstract: In recent years, the focus on developing robot manipulators has shifted towards prioritizing safety in Human-Robot Interaction (HRI). Impedance control is a typical approach for interaction control in collaboration tasks. However, such a control approach has two main limitations: 1) the end-effector (EE)'s limited compliance to adapt to unknown physical interactions, and 2) inability of the robot… ▽ More

    Submitted 4 February, 2025; originally announced February 2025.

    Comments: 8 pages, 11 figures

  20. arXiv:2501.15368  [pdf, other

    cs.CL cs.SD eess.AS

    Baichuan-Omni-1.5 Technical Report

    Authors: Yadong Li, Jun Liu, Tao Zhang, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, Chong Li, Yuanbo Fang, Dongdong Kuang, Mingrui Wang, Chenglin Zhu, Youwei Zhang, Hongyu Guo, Fengyu Zhang, Yuran Wang, Bowen Ding, Wei Song, Xu Li, Yuqi Huo, Zheng Liang , et al. (68 additional authors not shown)

    Abstract: We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pip… ▽ More

    Submitted 25 January, 2025; originally announced January 2025.

  21. Evaluation of Rail Decarbonization Alternatives: Framework and Application

    Authors: Adrian Hernandez, Max TM Ng, Nazib Siddique, Pablo L. Durango-Cohen, Amgad Elgowainy, Hani S. Mahmassani, Michael Wang, Yan Zhou

    Abstract: The Northwestern University Freight Rail Infrastructure and Energy Network Decarbonization (NUFRIEND) framework is a comprehensive industry-oriented tool for simulating the deployment of new energy technologies including biofuels, e-fuels, battery-electric, and hydrogen locomotives. By classifying fuel types into two categories based on deployment requirements, the associated optimal charging/fuel… ▽ More

    Submitted 2 January, 2025; originally announced January 2025.

    Comments: 29 pages, 17 figures. This is the accepted version of a work that was published in Transportation Research Record

    Journal ref: Transportation Research Record 2678.1 (2024): 102-121

  22. Technical Report: Towards Spatial Feature Regularization in Deep-Learning-Based Array-SAR Reconstruction

    Authors: Yu Ren, Xu Zhan, Yunqiao Hu, Xiangdong Ma, Liang Liu, Mou Wang, Jun Shi, Shunjun Wei, Tianjiao Zeng, Xiaoling Zhang

    Abstract: Array synthetic aperture radar (Array-SAR), also known as tomographic SAR (TomoSAR), has demonstrated significant potential for high-quality 3D mapping, particularly in urban areas.While deep learning (DL) methods have recently shown strengths in reconstruction, most studies rely on pixel-by-pixel reconstruction, neglecting spatial features like building structures, leading to artifacts such as ho… ▽ More

    Submitted 21 December, 2024; originally announced December 2024.

  23. arXiv:2412.14547  [pdf, other

    cs.CV eess.IV

    Bright-NeRF:Brightening Neural Radiance Field with Color Restoration from Low-light Raw Images

    Authors: Min Wang, Xin Huang, Guoqing Zhou, Qifeng Guo, Qing Wang

    Abstract: Neural Radiance Fields (NeRFs) have demonstrated prominent performance in novel view synthesis. However, their input heavily relies on image acquisition under normal light conditions, making it challenging to learn accurate scene representation in low-light environments where images typically exhibit significant noise and severe color distortion. To address these challenges, we propose a novel app… ▽ More

    Submitted 19 December, 2024; originally announced December 2024.

    Comments: Accepted by AAAI2025

  24. arXiv:2412.06401  [pdf, ps, other

    eess.SY

    Memory-Based Control with Event-Triggered Protocol for interval type-2 fuzzy network system under fading channel

    Authors: Sen Kong, Meng Wang

    Abstract: To address the challenges in networked environments and control problems associated with complex nonlinear uncertain systems, this paper investigates the design of a membership-function-dependent (MFD) memory output-feedback (MOF) controller for interval type-2 (IT2) fuzzy systems under fading channels, leveraging a memory dynamic event-triggering mechanism (MDETM). To conserve communication resou… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

  25. arXiv:2411.18953  [pdf, other

    eess.AS

    AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language Models

    Authors: Jisheng Bai, Haohe Liu, Mou Wang, Dongyuan Shi, Wenwu Wang, Mark D. Plumbley, Woon-Seng Gan, Jianfeng Chen

    Abstract: With the emergence of audio-language models, constructing large-scale paired audio-language datasets has become essential yet challenging for model development, primarily due to the time-intensive and labour-heavy demands involved. While large language models (LLMs) have improved the efficiency of synthetic audio caption generation, current approaches struggle to effectively extract and incorporat… ▽ More

    Submitted 28 November, 2024; originally announced November 2024.

  26. arXiv:2411.14135  [pdf, other

    eess.IV cs.MM

    Compact Visual Data Representation for Green Multimedia -- A Human Visual System Perspective

    Authors: Peilin Chen, Xiaohan Fang, Meng Wang, Shiqi Wang, Siwei Ma

    Abstract: The Human Visual System (HVS), with its intricate sophistication, is capable of achieving ultra-compact information compression for visual signals. This remarkable ability is coupled with high generalization capability and energy efficiency. By contrast, the state-of-the-art Versatile Video Coding (VVC) standard achieves a compression ratio of around 1,000 times for raw visual data. This notable d… ▽ More

    Submitted 26 December, 2024; v1 submitted 21 November, 2024; originally announced November 2024.

  27. arXiv:2411.06217  [pdf, other

    eess.AS

    Selective State Space Model for Monaural Speech Enhancement

    Authors: Moran Chen, Qiquan Zhang, Mingjiang Wang, Xiangyu Zhang, Hexin Liu, Eliathamby Ambikairaiah, Deying Chen

    Abstract: Voice user interfaces (VUIs) have facilitated the efficient interactions between humans and machines through spoken commands. Since real-word acoustic scenes are complex, speech enhancement plays a critical role for robust VUI. Transformer and its variants, such as Conformer, have demonstrated cutting-edge results in speech enhancement. However, both of them suffers from the quadratic computationa… ▽ More

    Submitted 9 November, 2024; originally announced November 2024.

    Comments: Submitted to IEEE TCE

  28. arXiv:2411.06193  [pdf, ps, other

    cs.IT eess.SP

    Large Language Models and Artificial Intelligence Generated Content Technologies Meet Communication Networks

    Authors: Jie Guo, Meiting Wang, Hang Yin, Bin Song, Yuhao Chi, Fei Richard Yu, Chau Yuen

    Abstract: Artificial intelligence generated content (AIGC) technologies, with a predominance of large language models (LLMs), have demonstrated remarkable performance improvements in various applications, which have attracted great interests from both academia and industry. Although some noteworthy advancements have been made in this area, a comprehensive exploration of the intricate relationship between AI… ▽ More

    Submitted 12 November, 2024; v1 submitted 9 November, 2024; originally announced November 2024.

    Comments: Accepted by IEEE Internet of Things Journal

  29. arXiv:2411.05205  [pdf, other

    eess.SY cs.AI cs.NI

    Maximizing User Connectivity in AI-Enabled Multi-UAV Networks: A Distributed Strategy Generalized to Arbitrary User Distributions

    Authors: Bowei Li, Yang Xu, Ran Zhang, Jiang, Xie, Miao Wang

    Abstract: Deep reinforcement learning (DRL) has been extensively applied to Multi-Unmanned Aerial Vehicle (UAV) network (MUN) to effectively enable real-time adaptation to complex, time-varying environments. Nevertheless, most of the existing works assume a stationary user distribution (UD) or a dynamic one with predicted patterns. Such considerations may make the UD-specific strategies insufficient when a… ▽ More

    Submitted 7 November, 2024; originally announced November 2024.

  30. arXiv:2411.00911  [pdf, other

    eess.IV cs.CV cs.LG physics.geo-ph

    Zero-Shot Self-Consistency Learning for Seismic Irregular Spatial Sampling Reconstruction

    Authors: Junheng Peng, Yingtian Liu, Mingwei Wang, Yong Li, Huating Li

    Abstract: Seismic exploration is currently the most important method for understanding subsurface structures. However, due to surface conditions, seismic receivers may not be uniformly distributed along the measurement line, making the entire exploration work difficult to carry out. Previous deep learning methods for reconstructing seismic data often relied on additional datasets for training. While some ex… ▽ More

    Submitted 1 November, 2024; originally announced November 2024.

    Comments: 12 pages, 8 figures

    MSC Class: 68T07 ACM Class: I.4.5

  31. arXiv:2410.21276  [pdf, other

    cs.CL cs.AI cs.CV cs.CY cs.LG cs.SD eess.AS

    GPT-4o System Card

    Authors: OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander MÄ…dry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis , et al. (395 additional authors not shown)

    Abstract: GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 mil… ▽ More

    Submitted 25 October, 2024; originally announced October 2024.

  32. arXiv:2409.19963  [pdf, other

    eess.IV cs.CV cs.LG

    A Self-attention Residual Convolutional Neural Network for Health Condition Classification of Cow Teat Images

    Authors: Minghao Wang

    Abstract: Milk is a highly important consumer for Americans and the health of the cows' teats directly affects the quality of the milk. Traditionally, veterinarians manually assessed teat health by visually inspecting teat-end hyperkeratosis during the milking process which is limited in time, usually only tens of seconds, and weakens the accuracy of the health assessment of cows' teats. Convolutional neura… ▽ More

    Submitted 30 September, 2024; originally announced September 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2409.18797

  33. arXiv:2409.18797  [pdf, ps, other

    cs.CV cs.AI cs.LG eess.IV

    Supervised Learning Model for Key Frame Identification from Cow Teat Videos

    Authors: Minghao Wang, Pinxue Lin

    Abstract: This paper proposes a method for improving the accuracy of mastitis risk assessment in cows using neural networks and video analysis. Mastitis, an infection of the udder tissue, is a critical health problem for cows and can be detected by examining the cow's teat. Traditionally, veterinarians assess the health of a cow's teat during the milking process, but this process is limited in time and can… ▽ More

    Submitted 26 September, 2024; originally announced September 2024.

  34. arXiv:2409.17603  [pdf, other

    cs.CL cs.SD eess.AS

    Deep CLAS: Deep Contextual Listen, Attend and Spell

    Authors: Mengzhi Wang, Shifu Xiong, Genshun Wan, Hang Chen, Jianqing Gao, Lirong Dai

    Abstract: Contextual-LAS (CLAS) has been shown effective in improving Automatic Speech Recognition (ASR) of rare words. It relies on phrase-level contextual modeling and attention-based relevance scoring without explicit contextual constraint which lead to insufficient use of contextual information. In this work, we propose deep CLAS to use contextual information better. We introduce bias loss forcing model… ▽ More

    Submitted 19 December, 2024; v1 submitted 26 September, 2024; originally announced September 2024.

    Comments: Submitted to JUSTC

  35. arXiv:2409.17139  [pdf, other

    eess.SY cs.LG cs.NI

    Learning with Dynamics: Autonomous Regulation of UAV Based Communication Networks with Dynamic UAV Crew

    Authors: Ran Zhang, Bowei Li, Liyuan Zhang, Jiang, Xie, Miao Wang

    Abstract: Unmanned Aerial Vehicle (UAV) based communication networks (UCNs) are a key component in future mobile networking. To handle the dynamic environments in UCNs, reinforcement learning (RL) has been a promising solution attributed to its strong capability of adaptive decision-making free of the environment models. However, most existing RL-based research focus on control strategy design assuming a fi… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

    Comments: 7 pages, 6 figures, magazine paper

  36. arXiv:2409.14330  [pdf, other

    eess.IV cs.CV

    Thinking in Granularity: Dynamic Quantization for Image Super-Resolution by Intriguing Multi-Granularity Clues

    Authors: Mingshen Wang, Zhao Zhang, Feng Li, Ke Xu, Kang Miao, Meng Wang

    Abstract: Dynamic quantization has attracted rising attention in image super-resolution (SR) as it expands the potential of heavy SR models onto mobile devices while preserving competitive performance. Existing methods explore layer-to-bit configuration upon varying local regions, adaptively allocating the bit to each layer and patch. Despite the benefits, they still fall short in the trade-off of SR accura… ▽ More

    Submitted 22 December, 2024; v1 submitted 22 September, 2024; originally announced September 2024.

    Comments: AAAI 2025

  37. Lightweight Transducer Based on Frame-Level Criterion

    Authors: Genshun Wan, Mengzhi Wang, Tingzhi Mao, Hang Chen, Zhongfu Ye

    Abstract: The transducer model trained based on sequence-level criterion requires a lot of memory due to the generation of the large probability matrix. We proposed a lightweight transducer model based on frame-level criterion, which uses the results of the CTC forced alignment algorithm to determine the label for each frame. Then the encoder output can be combined with the decoder output at the correspondi… ▽ More

    Submitted 1 November, 2024; v1 submitted 4 September, 2024; originally announced September 2024.

    Comments: Accepted by Interspeech 2024, code repository: https://github.com/wangmengzhi/Lightweight-Transducer

    Journal ref: Proc. Interspeech 2024, 247-251 (2024)

  38. arXiv:2409.13523  [pdf, other

    cs.CL cs.SD eess.AS

    EMMeTT: Efficient Multimodal Machine Translation Training

    Authors: Piotr Żelasko, Zhehuai Chen, Mengru Wang, Daniel Galvez, Oleksii Hrinchuk, Shuoyang Ding, Ke Hu, Jagadeesh Balam, Vitaly Lavrukhin, Boris Ginsburg

    Abstract: A rising interest in the modality extension of foundation language models warrants discussion on the most effective, and efficient, multimodal training approach. This work focuses on neural machine translation (NMT) and proposes a joint multimodal training regime of Speech-LLM to include automatic speech translation (AST). We investigate two different foundation model architectures, decoder-only G… ▽ More

    Submitted 20 September, 2024; originally announced September 2024.

    Comments: 4 pages, submitted to ICASSP 2025

  39. arXiv:2409.12375  [pdf

    eess.SP

    XRL: An FMM-Accelerated SIE Simulator for Resistance and Inductance Extraction of Complicated 3-D Geometries

    Authors: Mingyu Wang, Ping Liu, Jihong Gu, Xiaofan Jia, Abdulkadir C. Yucel

    Abstract: A fast multipole method (FMM)-accelerated surface integral equation (SIE) simulator, called XRL, is proposed for broadband resistance/inductance (RL) extraction under the magneto-quasi-static assumption. The proposed XRL has three key attributes that make it highly efficient and accurate for broadband RL extraction of complicated 3-D geometries: (i) The XRL leverages a novel centroid-midpoint basi… ▽ More

    Submitted 18 September, 2024; originally announced September 2024.

  40. arXiv:2409.10310  [pdf, other

    cs.RO eess.SY

    Safe and Real-Time Consistent Planning for Autonomous Vehicles in Partially Observed Environments via Parallel Consensus Optimization

    Authors: Lei Zheng, Rui Yang, Minzhe Zheng, Michael Yu Wang, Jun Ma

    Abstract: Ensuring safety and driving consistency is a significant challenge for autonomous vehicles operating in partially observed environments. This work introduces a consistent parallel trajectory optimization (CPTO) approach to enable safe and consistent driving in dense obstacle environments with perception uncertainties. Utilizing discrete-time barrier function theory, we develop a consensus safety b… ▽ More

    Submitted 16 September, 2024; originally announced September 2024.

  41. arXiv:2409.06010  [pdf, other

    cs.NI eess.SY

    When Learning Meets Dynamics: Distributed User Connectivity Maximization in UAV-Based Communication Networks

    Authors: Bowei Li, Saugat Tripathi, Salman Hosain, Ran Zhang, Jiang, Xie, Miao Wang

    Abstract: Distributed management over Unmanned Aerial Vehicle (UAV) based communication networks (UCNs) has attracted increasing research attention. In this work, we study a distributed user connectivity maximization problem in a UCN. The work features a horizontal study over different levels of information exchange during the distributed iteration and a consideration of dynamics in UAV set and user distrib… ▽ More

    Submitted 9 September, 2024; originally announced September 2024.

    Comments: 12 pages, 12 figures, journal draft

  42. arXiv:2409.04113  [pdf, ps, other

    eess.SP

    A New Channel Model for OAM Wireless Communication at 5.8 and 28 GHz

    Authors: Runyu Lyu, Wenchi Cheng, Muyao Wang, Fan Qin, Tony Q. S. Quek

    Abstract: Orbital angular momentum (OAM) in electromagnetic (EM) waves can significantly enhance spectrum efficiency in wireless communications without requiring additional power, time, or frequency resources. Different OAM modes in EM waves create orthogonal channels, thereby improving spectrum efficiency. Additionally, OAM waves can more easily maintain orthogonality in line-of-sight (LOS) transmissions,… ▽ More

    Submitted 6 September, 2024; originally announced September 2024.

    Comments: 13 pages, 13 figures, submitted to IEEE Transactions on Wireless Communications (TWC)

  43. arXiv:2409.02041  [pdf, other

    eess.AS cs.SD

    The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge

    Authors: Shutong Niu, Ruoyu Wang, Jun Du, Gaobin Yang, Yanhui Tu, Siyuan Wu, Shuangqing Qian, Huaxin Wu, Haitao Xu, Xueyang Zhang, Guolong Zhong, Xindi Yu, Jieru Chen, Mengzhi Wang, Di Cai, Tian Gao, Genshun Wan, Feng Ma, Jia Pan, Jianqing Gao

    Abstract: This technical report outlines our submission system for the CHiME-8 NOTSOFAR-1 Challenge. The primary difficulty of this challenge is the dataset recorded across various conference rooms, which captures real-world complexities such as high overlap rates, background noises, a variable number of speakers, and natural conversation styles. To address these issues, we optimized the system in several a… ▽ More

    Submitted 24 October, 2024; v1 submitted 3 September, 2024; originally announced September 2024.

  44. arXiv:2408.16886  [pdf, other

    eess.IV cs.CV

    LV-UNet: A Lightweight and Vanilla Model for Medical Image Segmentation

    Authors: Juntao Jiang, Mengmeng Wang, Huizhong Tian, Lingbo Cheng, Yong Liu

    Abstract: While large models have achieved significant progress in computer vision, challenges such as optimization complexity, the intricacy of transformer architectures, computational constraints, and practical application demands highlight the importance of simpler model designs in medical image segmentation. This need is particularly pronounced in mobile medical devices, which require lightweight, deplo… ▽ More

    Submitted 2 December, 2024; v1 submitted 29 August, 2024; originally announced August 2024.

    Comments: Accepted by IEEE BIBM2024 ML4BMI workshop

  45. arXiv:2408.10934  [pdf, other

    cs.CV cs.AI eess.IV

    SDI-Net: Toward Sufficient Dual-View Interaction for Low-light Stereo Image Enhancement

    Authors: Linlin Hu, Ao Sun, Shijie Hao, Richang Hong, Meng Wang

    Abstract: Currently, most low-light image enhancement methods only consider information from a single view, neglecting the correlation between cross-view information. Therefore, the enhancement results produced by these methods are often unsatisfactory. In this context, there have been efforts to develop methods specifically for low-light stereo image enhancement. These methods take into account the cross-v… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

  46. arXiv:2407.21600  [pdf, other

    eess.IV cs.AI cs.CV eess.SP physics.med-ph

    Robust Simultaneous Multislice MRI Reconstruction Using Deep Generative Priors

    Authors: Shoujin Huang, Guanxiong Luo, Yunlin Zhao, Yilong Liu, Yuwan Wang, Kexin Yang, Jingzhe Liu, Hua Guo, Min Wang, Lingyan Zhang, Mengye Lyu

    Abstract: Simultaneous multislice (SMS) imaging is a powerful technique for accelerating magnetic resonance imaging (MRI) acquisitions. However, SMS reconstruction remains challenging due to complex signal interactions between and within the excited slices. In this study, we introduce ROGER, a robust SMS MRI reconstruction method based on deep generative priors. Utilizing denoising diffusion probabilistic m… ▽ More

    Submitted 23 January, 2025; v1 submitted 31 July, 2024; originally announced July 2024.

    Comments: Submitted to Medical Image Analysis. New fMRI analysis and figures are added since v1

  47. arXiv:2407.18099  [pdf, other

    eess.SY cs.RO

    Pose, Velocity and Landmark Position Estimation Using IMU and Bearing Measurements

    Authors: Miaomiao Wang, Abdelhamid Tayebi

    Abstract: This paper investigates the estimation problem of the pose (orientation and position) and linear velocity of a rigid body, as well as the landmark positions, using an inertial measurement unit (IMU) and a monocular camera. First, we propose a globally exponentially stable (GES) linear time-varying (LTV) observer for the estimation of body-frame landmark positions and velocity, using IMU and monocu… ▽ More

    Submitted 25 July, 2024; originally announced July 2024.

    Comments: 8 pages, 3 figures

  48. arXiv:2407.13255  [pdf, other

    cs.IT eess.SP

    Interleaved Block-Sparse Transform

    Authors: Lei Liu, Ming Wang, Shufeng Li, Yuhao Chi, Ning Wei, ZhaoYang Zhang

    Abstract: Low-complexity Bayes-optimal memory approximate message passing (MAMP) is an efficient signal estimation algorithm in compressed sensing and multicarrier modulation. However, achieving replica Bayes optimality with MAMP necessitates a large-scale right-unitarily invariant transformation, which is prohibitive in practical systems due to its high computational complexity and hardware costs. To solve… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Submitted to the IEEE Journal

  49. arXiv:2407.13229  [pdf, other

    cs.RO eess.SY

    Learning-based Observer for Coupled Disturbance

    Authors: Jindou Jia, Meng Wang, Zihan Yang, Bin Yang, Yuhang Liu, Kexin Guo, Xiang Yu

    Abstract: Achieving high-precision control for robotic systems is hindered by the low-fidelity dynamical model and external disturbances. Especially, the intricate coupling between internal uncertainties and external disturbances further exacerbates this challenge. This study introduces an effective and convergent algorithm enabling accurate estimation of the coupled disturbance via combining control and le… ▽ More

    Submitted 14 April, 2025; v1 submitted 18 July, 2024; originally announced July 2024.

    Comments: 17 pages, 9 figures

  50. arXiv:2407.11322  [pdf, ps, other

    eess.SP

    Reconfigurable-Intelligent-Surface Assisted Orbital-Angular-Momentum Secure Communications

    Authors: Minmin Wang, Liping Liang, Wenchi Cheng, Wei Zhang, Ruirui Chen, Hailin Zhang

    Abstract: As a kind of wavefront with helical phase, orbital angular momentum (OAM) shows the great potential to enhance the security results of wireless communications due to its unique orthogonality and central hollow electromagnetic wave structure. Therefore, in this paper we propose the reconfigurable-intelligent-surface (RIS) assisted OAM scheme, where RIS is deployed to weaken the information acquisit… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: arXiv admin note: text overlap with arXiv:2406.05799