Search | arXiv e-print repository

arXiv:2509.20741 [pdf, ps, other]

Real-Time System for Audio-Visual Target Speech Enhancement

Authors: T. Aleksandra Ma, Sile Yin, Li-Chia Yang, Shuo Zhang

Abstract: We present a live demonstration for RAVEN, a real-time audio-visual speech enhancement system designed to run entirely on a CPU. In single-channel, audio-only settings, speech enhancement is traditionally approached as the task of extracting clean speech from environmental noise. More recent work has explored the use of visual cues, such as lip movements, to improve robustness, particularly in the… ▽ More We present a live demonstration for RAVEN, a real-time audio-visual speech enhancement system designed to run entirely on a CPU. In single-channel, audio-only settings, speech enhancement is traditionally approached as the task of extracting clean speech from environmental noise. More recent work has explored the use of visual cues, such as lip movements, to improve robustness, particularly in the presence of interfering speakers. However, to our knowledge, no prior work has demonstrated an interactive system for real-time audio-visual speech enhancement operating on CPU hardware. RAVEN fills this gap by using pretrained visual embeddings from an audio-visual speech recognition model to encode lip movement information. The system generalizes across environmental noise, interfering speakers, transient sounds, and even singing voices. In this demonstration, attendees will be able to experience live audio-visual target speech enhancement using a microphone and webcam setup, with clean speech playback through headphones. △ Less

Submitted 25 September, 2025; originally announced September 2025.

Comments: Accepted into WASPAA 2025 demo session

arXiv:2507.21448 [pdf, ps, other]

Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations

Authors: T. Aleksandra Ma, Sile Yin, Li-Chia Yang, Shuo Zhang

Abstract: Speech enhancement in audio-only settings remains challenging, particularly in the presence of interfering speakers. This paper presents a simple yet effective real-time audio-visual speech enhancement (AVSE) system, RAVEN, which isolates and enhances the on-screen target speaker while suppressing interfering speakers and background noise. We investigate how visual embeddings learned from audio-vi… ▽ More Speech enhancement in audio-only settings remains challenging, particularly in the presence of interfering speakers. This paper presents a simple yet effective real-time audio-visual speech enhancement (AVSE) system, RAVEN, which isolates and enhances the on-screen target speaker while suppressing interfering speakers and background noise. We investigate how visual embeddings learned from audio-visual speech recognition (AVSR) and active speaker detection (ASD) contribute to AVSE across different SNR conditions and numbers of interfering speakers. Our results show concatenating embeddings from AVSR and ASD models provides the greatest improvement in low-SNR, multi-speaker environments, while AVSR embeddings alone perform best in noise-only scenarios. In addition, we develop a real-time streaming system that operates on a computer CPU and we provide a video demonstration and code repository. To our knowledge, this is the first open-source implementation of a real-time AVSE system. △ Less

Submitted 4 August, 2025; v1 submitted 28 July, 2025; originally announced July 2025.

Comments: Accepted into Interspeech 2025; corrected author name typo

arXiv:2507.19531 [pdf, ps, other]

A safety governor for learning explicit MPC controllers from data

Authors: Anjie Mao, Zheming Wang, Hao Gu, Bo Chen, Li Yu

Abstract: We tackle neural networks (NNs) to approximate model predictive control (MPC) laws. We propose a novel learning-based explicit MPC structure, which is reformulated into a dual-mode scheme over maximal constrained feasible set. The scheme ensuring the learning-based explicit MPC reduces to linear feedback control while entering the neighborhood of origin. We construct a safety governor to ensure th… ▽ More We tackle neural networks (NNs) to approximate model predictive control (MPC) laws. We propose a novel learning-based explicit MPC structure, which is reformulated into a dual-mode scheme over maximal constrained feasible set. The scheme ensuring the learning-based explicit MPC reduces to linear feedback control while entering the neighborhood of origin. We construct a safety governor to ensure that learning-based explicit MPC satisfies all the state and input constraints. Compare to the existing approach, our approach is computationally easier to implement even in high-dimensional system. The proof of recursive feasibility for the safety governor is given. Our approach is demonstrated on numerical examples. △ Less

Submitted 21 July, 2025; originally announced July 2025.

arXiv:2501.11570 [pdf, other]

Uncertainty Estimation in the Real World: A Study on Music Emotion Recognition

Authors: Karn N. Watcharasupat, Yiwei Ding, T. Aleksandra Ma, Pavan Seshadri, Alexander Lerch

Abstract: Any data annotation for subjective tasks shows potential variations between individuals. This is particularly true for annotations of emotional responses to musical stimuli. While older approaches to music emotion recognition systems frequently addressed this uncertainty problem through probabilistic modeling, modern systems based on neural networks tend to ignore the variability and focus only on… ▽ More Any data annotation for subjective tasks shows potential variations between individuals. This is particularly true for annotations of emotional responses to musical stimuli. While older approaches to music emotion recognition systems frequently addressed this uncertainty problem through probabilistic modeling, modern systems based on neural networks tend to ignore the variability and focus only on predicting central tendencies of human subjective responses. In this work, we explore several methods for estimating not only the central tendencies of the subjective responses to a musical stimulus, but also for estimating the uncertainty associated with these responses. In particular, we investigate probabilistic loss functions and inference-time random sampling. Experimental results indicate that while the modeling of the central tendencies is achievable, modeling of the uncertainty in subjective responses proves significantly more challenging with currently available approaches even when empirical estimates of variations in the responses are available. △ Less

Submitted 20 January, 2025; originally announced January 2025.

Comments: To be presented as a Findings paper at the 2025 European Conference on Information Retrieval (ECIR)

arXiv:2412.19552 [pdf, ps, other]

Contrast-Optimized Basis Functions for Self-Navigated Motion Correction in Quantitative MRI

Authors: Elisa Marchetto, Sebastian Flassbeck, Andrew Mao, Jakob Assländer

Abstract: Purpose: The long scan times of quantitative MRI techniques make motion artifacts more likely. For MR-Fingerprinting-like approaches, this problem can be addressed with self-navigated retrospective motion correction based on reconstructions in a singular value decomposition (SVD) subspace. However, the SVD promotes high signal intensity in all tissues, which limits the contrast between tissue type… ▽ More Purpose: The long scan times of quantitative MRI techniques make motion artifacts more likely. For MR-Fingerprinting-like approaches, this problem can be addressed with self-navigated retrospective motion correction based on reconstructions in a singular value decomposition (SVD) subspace. However, the SVD promotes high signal intensity in all tissues, which limits the contrast between tissue types and ultimately reduces the accuracy of registration. The purpose of this paper is to rotate the subspace for maximum contrast between two types of tissue and improve the accuracy of motion estimates. Methods: A subspace is derived that promotes contrasts between brain parenchyma and CSF, achieved through the generalized eigendecomposition of mean autocorrelation matrices, followed by a Gram-Schmidt process to maintain orthogonality. We tested our motion correction method on 85 scans with varying motion levels, acquired with a 3D hybrid-state sequence optimized for quantitative magnetization transfer imaging. Results: A comparative analysis shows that the contrast-optimized basis significantly improve the parenchyma-CSF contrast, leading to smoother motion estimates and reduced artifacts in the quantitative maps. Conclusion: The proposed contrast-optimized subspace improves the accuracy of the motion estimation. △ Less

Submitted 17 June, 2025; v1 submitted 27 December, 2024; originally announced December 2024.

arXiv:2409.07730 [pdf, other]

Music auto-tagging in the long tail: A few-shot approach

Authors: T. Aleksandra Ma, Alexander Lerch

Abstract: In the realm of digital music, using tags to efficiently organize and retrieve music from extensive databases is crucial for music catalog owners. Human tagging by experts is labor-intensive but mostly accurate, whereas automatic tagging through supervised learning has approached satisfying accuracy but is restricted to a predefined set of training tags. Few-shot learning offers a viable solution… ▽ More In the realm of digital music, using tags to efficiently organize and retrieve music from extensive databases is crucial for music catalog owners. Human tagging by experts is labor-intensive but mostly accurate, whereas automatic tagging through supervised learning has approached satisfying accuracy but is restricted to a predefined set of training tags. Few-shot learning offers a viable solution to expand beyond this small set of predefined tags by enabling models to learn from only a few human-provided examples to understand tag meanings and subsequently apply these tags autonomously. We propose to integrate few-shot learning methodology into multi-label music auto-tagging by using features from pre-trained models as inputs to a lightweight linear classifier, also known as a linear probe. We investigate different popular pre-trained features, as well as different few-shot parametrizations with varying numbers of classes and samples per class. Our experiments demonstrate that a simple model with pre-trained features can achieve performance close to state-of-the-art models while using significantly less training data, such as 20 samples per tag. Additionally, our linear probe performs competitively with leading models when trained on the entire training dataset. The results show that this transfer learning-based few-shot approach could effectively address the issue of automatically assigning long-tail tags with only limited labeled data. △ Less

Submitted 16 September, 2024; v1 submitted 11 September, 2024; originally announced September 2024.

Comments: Published in Audio Engineering Society NY Show 2024 as a Peer Reviewed (Category 1) paper; typos corrected

ACM Class: H.3.3

arXiv:2405.07905 [pdf, other]

PLUTO: Pathology-Universal Transformer

Authors: Dinkar Juyal, Harshith Padigela, Chintan Shah, Daniel Shenker, Natalia Harguindeguy, Yi Liu, Blake Martin, Yibo Zhang, Michael Nercessian, Miles Markey, Isaac Finberg, Kelsey Luu, Daniel Borders, Syed Ashar Javed, Emma Krause, Raymond Biju, Aashish Sood, Allen Ma, Jackson Nyman, John Shamshoian, Guillaume Chhor, Darpan Sanghavi, Marc Thibault, Limin Yu, Fedaa Najdawi , et al. (8 additional authors not shown)

Abstract: Pathology is the study of microscopic inspection of tissue, and a pathology diagnosis is often the medical gold standard to diagnose disease. Pathology images provide a unique challenge for computer-vision-based analysis: a single pathology Whole Slide Image (WSI) is gigapixel-sized and often contains hundreds of thousands to millions of objects of interest across multiple resolutions. In this wor… ▽ More Pathology is the study of microscopic inspection of tissue, and a pathology diagnosis is often the medical gold standard to diagnose disease. Pathology images provide a unique challenge for computer-vision-based analysis: a single pathology Whole Slide Image (WSI) is gigapixel-sized and often contains hundreds of thousands to millions of objects of interest across multiple resolutions. In this work, we propose PathoLogy Universal TransfOrmer (PLUTO): a light-weight pathology FM that is pre-trained on a diverse dataset of 195 million image tiles collected from multiple sites and extracts meaningful representations across multiple WSI scales that enable a large variety of downstream pathology tasks. In particular, we design task-specific adaptation heads that utilize PLUTO's output embeddings for tasks which span pathology scales ranging from subcellular to slide-scale, including instance segmentation, tile classification, and slide-level prediction. We compare PLUTO's performance to other state-of-the-art methods on a diverse set of external and internal benchmarks covering multiple biologically relevant tasks, tissue types, resolutions, stains, and scanners. We find that PLUTO matches or outperforms existing task-specific baselines and pathology-specific foundation models, some of which use orders-of-magnitude larger datasets and model sizes when compared to PLUTO. Our findings present a path towards a universal embedding to power pathology image analysis, and motivate further exploration around pathology foundation models in terms of data diversity, architectural improvements, sample efficiency, and practical deployability in real-world applications. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2403.00892 [pdf, other]

PowerFlowMultiNet: Multigraph Neural Networks for Unbalanced Three-Phase Distribution Systems

Authors: Salah Ghamizi, Jun Cao, Aoxiang Ma, Pedro Rodriguez

Abstract: Efficiently solving unbalanced three-phase power flow in distribution grids is pivotal for grid analysis and simulation. There is a pressing need for scalable algorithms capable of handling large-scale unbalanced power grids that can provide accurate and fast solutions. To address this, deep learning techniques, especially Graph Neural Networks (GNNs), have emerged. However, existing literature pr… ▽ More Efficiently solving unbalanced three-phase power flow in distribution grids is pivotal for grid analysis and simulation. There is a pressing need for scalable algorithms capable of handling large-scale unbalanced power grids that can provide accurate and fast solutions. To address this, deep learning techniques, especially Graph Neural Networks (GNNs), have emerged. However, existing literature primarily focuses on balanced networks, leaving a critical gap in supporting unbalanced three-phase power grids. This letter introduces PowerFlowMultiNet, a novel multigraph GNN framework explicitly designed for unbalanced three-phase power grids. The proposed approach models each phase separately in a multigraph representation, effectively capturing the inherent asymmetry in unbalanced grids. A graph embedding mechanism utilizing message passing is introduced to capture spatial dependencies within the power system network. PowerFlowMultiNet outperforms traditional methods and other deep learning approaches in terms of accuracy and computational speed. Rigorous testing reveals significantly lower error rates and a notable hundredfold increase in computational speed for large power networks compared to model-based methods. △ Less

Submitted 6 September, 2024; v1 submitted 1 March, 2024; originally announced March 2024.

arXiv:2306.04730 [pdf, other]

Stochastic Natural Thresholding Algorithms

Authors: Rachel Grotheer, Shuang Li, Anna Ma, Deanna Needell, Jing Qin

Abstract: Sparse signal recovery is one of the most fundamental problems in various applications, including medical imaging and remote sensing. Many greedy algorithms based on the family of hard thresholding operators have been developed to solve the sparse signal recovery problem. More recently, Natural Thresholding (NT) has been proposed with improved computational efficiency. This paper proposes and disc… ▽ More Sparse signal recovery is one of the most fundamental problems in various applications, including medical imaging and remote sensing. Many greedy algorithms based on the family of hard thresholding operators have been developed to solve the sparse signal recovery problem. More recently, Natural Thresholding (NT) has been proposed with improved computational efficiency. This paper proposes and discusses convergence guarantees for stochastic natural thresholding algorithms by extending the NT from the deterministic version with linear measurements to the stochastic version with a general objective function. We also conduct various numerical experiments on linear and nonlinear measurements to demonstrate the performance of StoNT. △ Less

Submitted 7 June, 2023; originally announced June 2023.

arXiv:2208.09096 [pdf, other]

Representation Learning for the Automatic Indexing of Sound Effects Libraries

Authors: Alison B. Ma, Alexander Lerch

Abstract: Labeling and maintaining a commercial sound effects library is a time-consuming task exacerbated by databases that continually grow in size and undergo taxonomy updates. Moreover, sound search and taxonomy creation are complicated by non-uniform metadata, an unrelenting problem even with the introduction of a new industry standard, the Universal Category System. To address these problems and overc… ▽ More Labeling and maintaining a commercial sound effects library is a time-consuming task exacerbated by databases that continually grow in size and undergo taxonomy updates. Moreover, sound search and taxonomy creation are complicated by non-uniform metadata, an unrelenting problem even with the introduction of a new industry standard, the Universal Category System. To address these problems and overcome dataset-dependent limitations that inhibit the successful training of deep learning models, we pursue representation learning to train generalized embeddings that can be used for a wide variety of sound effects libraries and are a taxonomy-agnostic representation of sound. We show that a task-specific but dataset-independent representation can successfully address data issues such as class imbalance, inconsistent class labels, and insufficient dataset size, outperforming established representations such as OpenL3. Detailed experimental results show the impact of metric learning approaches and different cross-dataset training methods on representational effectiveness. △ Less

Submitted 18 August, 2022; originally announced August 2022.

Comments: Accepted at the 23rd International Society for Music Information Retrieval Conference (ISMIR 2022), 10 pages, 7 figures

arXiv:2207.08998 [pdf]

Discovering novel systemic biomarkers in photos of the external eye

Authors: Boris Babenko, Ilana Traynis, Christina Chen, Preeti Singh, Akib Uddin, Jorge Cuadros, Lauren P. Daskivich, April Y. Maa, Ramasamy Kim, Eugene Yu-Chuan Kang, Yossi Matias, Greg S. Corrado, Lily Peng, Dale R. Webster, Christopher Semturs, Jonathan Krause, Avinash V. Varadarajan, Naama Hammel, Yun Liu

Abstract: External eye photos were recently shown to reveal signs of diabetic retinal disease and elevated HbA1c. In this paper, we evaluate if external eye photos contain information about additional systemic medical conditions. We developed a deep learning system (DLS) that takes external eye photos as input and predicts multiple systemic parameters, such as those related to the liver (albumin, AST); kidn… ▽ More External eye photos were recently shown to reveal signs of diabetic retinal disease and elevated HbA1c. In this paper, we evaluate if external eye photos contain information about additional systemic medical conditions. We developed a deep learning system (DLS) that takes external eye photos as input and predicts multiple systemic parameters, such as those related to the liver (albumin, AST); kidney (eGFR estimated using the race-free 2021 CKD-EPI creatinine equation, the urine ACR); bone & mineral (calcium); thyroid (TSH); and blood count (Hgb, WBC, platelets). Development leveraged 151,237 images from 49,015 patients with diabetes undergoing diabetic eye screening in 11 sites across Los Angeles county, CA. Evaluation focused on 9 pre-specified systemic parameters and leveraged 3 validation sets (A, B, C) spanning 28,869 patients with and without diabetes undergoing eye screening in 3 independent sites in Los Angeles County, CA, and the greater Atlanta area, GA. We compared against baseline models incorporating available clinicodemographic variables (e.g. age, sex, race/ethnicity, years with diabetes). Relative to the baseline, the DLS achieved statistically significant superior performance at detecting AST>36, calcium<8.6, eGFR<60, Hgb<11, platelets<150, ACR>=300, and WBC<4 on validation set A (a patient population similar to the development sets), where the AUC of DLS exceeded that of the baseline by 5.2-19.4%. On validation sets B and C, with substantial patient population differences compared to the development sets, the DLS outperformed the baseline for ACR>=300 and Hgb<11 by 7.3-13.2%. Our findings provide further evidence that external eye photos contain important biomarkers of systemic health spanning multiple organ systems. Further work is needed to investigate whether and how these biomarkers can be translated into clinical impact. △ Less

Submitted 18 July, 2022; originally announced July 2022.

arXiv:2206.09008 [pdf, ps, other]

doi 10.1002/cta.3488

Orthogonal Rational Approximation of Transfer Functions for High-Frequency Circuits

Authors: Andrew Ma, Arif Ege Engin

Abstract: Rational function approximations find applications in many areas including macro-modeling of high-frequency circuits, model order reduction for controller design, interpolation and extrapolation of system responses, surrogate models for high-energy physics, and approximation of elementary mathematical functions. The unknown denominator polynomial of the model results in a non-linear problem, which… ▽ More Rational function approximations find applications in many areas including macro-modeling of high-frequency circuits, model order reduction for controller design, interpolation and extrapolation of system responses, surrogate models for high-energy physics, and approximation of elementary mathematical functions. The unknown denominator polynomial of the model results in a non-linear problem, which can be replaced with successive solutions of linearized problems following the Sanathanan-Koerner (SK) iteration. An orthogonal basis can be obtained based on Arnoldi resulting in a stabilized SK iteration. We present an extension of the stabilized SK, called Orthogonal Rational Approximation (ORA), which ensures real polynomial coefficients and stable poles for realizability of electrical networks. We also introduce an efficient implementation of ORA for multi-port networks based on a block QR decomposition. △ Less

Submitted 17 June, 2022; originally announced June 2022.

Journal ref: Int J Circ Theor Appl. 2022; 1- 13

arXiv:2110.13670 [pdf, other]

W-Net: A Two-Stage Convolutional Network for Nucleus Detection in Histopathology Image

Authors: Anyu Mao, Jialun Wu, Xinrui Bao, Zeyu Gao, Tieliang Gong, Chen Li

Abstract: Pathological diagnosis is the gold standard for cancer diagnosis, but it is labor-intensive, in which tasks such as cell detection, classification, and counting are particularly prominent. A common solution for automating these tasks is using nucleus segmentation technology. However, it is hard to train a robust nucleus segmentation model, due to several challenging problems, the nucleus adhesion,… ▽ More Pathological diagnosis is the gold standard for cancer diagnosis, but it is labor-intensive, in which tasks such as cell detection, classification, and counting are particularly prominent. A common solution for automating these tasks is using nucleus segmentation technology. However, it is hard to train a robust nucleus segmentation model, due to several challenging problems, the nucleus adhesion, stacking, and excessive fusion with the background. Recently, some researchers proposed a series of automatic nucleus segmentation methods based on point annotation, which can significant improve the model performance. Nevertheless, the point annotation needs to be marked by experienced pathologists. In order to take advantage of segmentation methods based on point annotation, further alleviate the manual workload, and make cancer diagnosis more efficient and accurate, it is necessary to develop an automatic nucleus detection algorithm, which can automatically and efficiently locate the position of the nucleus in the pathological image and extract valuable information for pathologists. In this paper, we propose a W-shaped network for automatic nucleus detection. Different from the traditional U-Net based method, mapping the original pathology image to the target mask directly, our proposed method split the detection task into two sub-tasks. The first sub-task maps the original pathology image to the binary mask, then the binary mask is mapped to the density mask in the second sub-task. After the task is split, the task's difficulty is significantly reduced, and the network's overall performance is improved. △ Less

Submitted 26 October, 2021; originally announced October 2021.

Comments: BIBM 2021 accepted,including 8 pages, 3 figures

arXiv:2011.11732 [pdf]

doi 10.1038/s41551-022-00867-5

Detecting hidden signs of diabetes in external eye photographs

Authors: Boris Babenko, Akinori Mitani, Ilana Traynis, Naho Kitade, Preeti Singh, April Maa, Jorge Cuadros, Greg S. Corrado, Lily Peng, Dale R. Webster, Avinash Varadarajan, Naama Hammel, Yun Liu

Abstract: Diabetes-related retinal conditions can be detected by examining the posterior of the eye. By contrast, examining the anterior of the eye can reveal conditions affecting the front of the eye, such as changes to the eyelids, cornea, or crystalline lens. In this work, we studied whether external photographs of the front of the eye can reveal insights into both diabetic retinal diseases and blood glu… ▽ More Diabetes-related retinal conditions can be detected by examining the posterior of the eye. By contrast, examining the anterior of the eye can reveal conditions affecting the front of the eye, such as changes to the eyelids, cornea, or crystalline lens. In this work, we studied whether external photographs of the front of the eye can reveal insights into both diabetic retinal diseases and blood glucose control. We developed a deep learning system (DLS) using external eye photographs of 145,832 patients with diabetes from 301 diabetic retinopathy (DR) screening sites in one US state, and evaluated the DLS on three validation sets containing images from 198 sites in 18 other US states. In validation set A (n=27,415 patients, all undilated), the DLS detected poor blood glucose control (HbA1c > 9%) with an area under receiver operating characteristic curve (AUC) of 70.2; moderate-or-worse DR with an AUC of 75.3; diabetic macular edema with an AUC of 78.0; and vision-threatening DR with an AUC of 79.4. For all 4 prediction tasks, the DLS's AUC was higher (p<0.001) than using available self-reported baseline characteristics (age, sex, race/ethnicity, years with diabetes). In terms of positive predictive value, the predicted top 5% of patients had a 67% chance of having HbA1c > 9%, and a 20% chance of having vision threatening diabetic retinopathy. The results generalized to dilated pupils (validation set B, 5,058 patients) and to a different screening service (validation set C, 10,402 patients). Our results indicate that external eye photographs contain information useful for healthcare providers managing patients with diabetes, and may help prioritize patients for in-person screening. Further work is needed to validate these findings on different devices and patient populations (those without diabetes) to evaluate its utility for remote diagnosis and management. △ Less

Submitted 23 November, 2020; originally announced November 2020.

Journal ref: Nature Biomedical Engineering 2022

arXiv:2011.09766 [pdf, other]

Foreground-Aware Relation Network for Geospatial Object Segmentation in High Spatial Resolution Remote Sensing Imagery

Authors: Zhuo Zheng, Yanfei Zhong, Junjue Wang, Ailong Ma

Abstract: Geospatial object segmentation, as a particular semantic segmentation task, always faces with larger-scale variation, larger intra-class variance of background, and foreground-background imbalance in the high spatial resolution (HSR) remote sensing imagery. However, general semantic segmentation methods mainly focus on scale variation in the natural scene, with inadequate consideration of the othe… ▽ More Geospatial object segmentation, as a particular semantic segmentation task, always faces with larger-scale variation, larger intra-class variance of background, and foreground-background imbalance in the high spatial resolution (HSR) remote sensing imagery. However, general semantic segmentation methods mainly focus on scale variation in the natural scene, with inadequate consideration of the other two problems that usually happen in the large area earth observation scene. In this paper, we argue that the problems lie on the lack of foreground modeling and propose a foreground-aware relation network (FarSeg) from the perspectives of relation-based and optimization-based foreground modeling, to alleviate the above two problems. From perspective of relation, FarSeg enhances the discrimination of foreground features via foreground-correlated contexts associated by learning foreground-scene relation. Meanwhile, from perspective of optimization, a foreground-aware optimization is proposed to focus on foreground examples and hard examples of background during training for a balanced optimization. The experimental results obtained using a large scale dataset suggest that the proposed method is superior to the state-of-the-art general semantic segmentation methods and achieves a better trade-off between speed and accuracy. Code has been made available at: \url{https://github.com/Z-Zheng/FarSeg}. △ Less

Submitted 19 November, 2020; originally announced November 2020.

Comments: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). 2020

arXiv:2011.05670 [pdf, other]

doi 10.1109/TGRS.2020.2967821

FPGA: Fast Patch-Free Global Learning Framework for Fully End-to-End Hyperspectral Image Classification

Authors: Zhuo Zheng, Yanfei Zhong, Ailong Ma, Liangpei Zhang

Abstract: Deep learning techniques have provided significant improvements in hyperspectral image (HSI) classification. The current deep learning based HSI classifiers follow a patch-based learning framework by dividing the image into overlapping patches. As such, these methods are local learning methods, which have a high computational cost. In this paper, a fast patch-free global learning (FPGA) framework… ▽ More Deep learning techniques have provided significant improvements in hyperspectral image (HSI) classification. The current deep learning based HSI classifiers follow a patch-based learning framework by dividing the image into overlapping patches. As such, these methods are local learning methods, which have a high computational cost. In this paper, a fast patch-free global learning (FPGA) framework is proposed for HSI classification. In FPGA, an encoder-decoder based FCN is utilized to consider the global spatial information by processing the whole image, which results in fast inference. However, it is difficult to directly utilize the encoder-decoder based FCN for HSI classification as it always fails to converge due to the insufficiently diverse gradients caused by the limited training samples. To solve the divergence problem and maintain the abilities of FCN of fast inference and global spatial information mining, a global stochastic stratified sampling strategy is first proposed by transforming all the training samples into a stochastic sequence of stratified samples. This strategy can obtain diverse gradients to guarantee the convergence of the FCN in the FPGA framework. For a better design of FCN architecture, FreeNet, which is a fully end-to-end network for HSI classification, is proposed to maximize the exploitation of the global spatial information and boost the performance via a spectral attention based encoder and a lightweight decoder. A lateral connection module is also designed to connect the encoder and decoder, fusing the spatial details in the encoder and the semantic features in the decoder. The experimental results obtained using three public benchmark datasets suggest that the FPGA framework is superior to the patch-based framework in both speed and accuracy for HSI classification. Code has been made available at: https://github.com/Z-Zheng/FreeNet. △ Less

Submitted 11 November, 2020; originally announced November 2020.

Comments: 16 pages, 15 figures, IEEE Transactions on Geoscience and Remote Sensing, 2020

arXiv:2011.03247 [pdf, other]

Hi-UCD: A Large-scale Dataset for Urban Semantic Change Detection in Remote Sensing Imagery

Authors: Shiqi Tian, Ailong Ma, Zhuo Zheng, Yanfei Zhong

Abstract: With the acceleration of the urban expansion, urban change detection (UCD), as a significant and effective approach, can provide the change information with respect to geospatial objects for dynamical urban analysis. However, existing datasets suffer from three bottlenecks: (1) lack of high spatial resolution images; (2) lack of semantic annotation; (3) lack of long-range multi-temporal images. In… ▽ More With the acceleration of the urban expansion, urban change detection (UCD), as a significant and effective approach, can provide the change information with respect to geospatial objects for dynamical urban analysis. However, existing datasets suffer from three bottlenecks: (1) lack of high spatial resolution images; (2) lack of semantic annotation; (3) lack of long-range multi-temporal images. In this paper, we propose a large scale benchmark dataset, termed Hi-UCD. This dataset uses aerial images with a spatial resolution of 0.1 m provided by the Estonia Land Board, including three-time phases, and semantically annotated with nine classes of land cover to obtain the direction of ground objects change. It can be used for detecting and analyzing refined urban changes. We benchmark our dataset using some classic methods in binary and multi-class change detection. Experimental results show that Hi-UCD is challenging yet useful. We hope the Hi-UCD can become a strong benchmark accelerating future research. △ Less

Submitted 27 December, 2020; v1 submitted 6 November, 2020; originally announced November 2020.

Comments: Presented at NeurIPS 2020 Workshop on Machine Learning for the Developing World

arXiv:1801.10264 [pdf, other]

Compressed Anomaly Detection with Multiple Mixed Observations

Authors: Natalie Durgin, Rachel Grotheer, Chenxi Huang, Shuang Li, Anna Ma, Deanna Needell, Jing Qin

Abstract: We consider a collection of independent random variables that are identically distributed, except for a small subset which follows a different, anomalous distribution. We study the problem of detecting which random variables in the collection are governed by the anomalous distribution. Recent work proposes to solve this problem by conducting hypothesis tests based on mixed observations (e.g. linea… ▽ More We consider a collection of independent random variables that are identically distributed, except for a small subset which follows a different, anomalous distribution. We study the problem of detecting which random variables in the collection are governed by the anomalous distribution. Recent work proposes to solve this problem by conducting hypothesis tests based on mixed observations (e.g. linear combinations) of the random variables. Recognizing the connection between taking mixed observations and compressed sensing, we view the problem as recovering the "support" (index set) of the anomalous random variables from multiple measurement vectors (MMVs). Many algorithms have been developed for recovering jointly sparse signals and their support from MMVs. We establish the theoretical and empirical effectiveness of these algorithms at detecting anomalies. We also extend the LASSO algorithm to an MMV version for our purpose. Further, we perform experiments on synthetic data, consisting of samples from the random variables, to explore the trade-off between the number of mixed observations per sample and the number of samples required to detect anomalies. △ Less

Submitted 19 June, 2018; v1 submitted 30 January, 2018; originally announced January 2018.

Comments: 27 pages, 9 figures. Incorporates reviewer feedback, additional experiments, and additional figures

arXiv:1711.02743 [pdf, other]

Sparse Randomized Kaczmarz for Support Recovery of Jointly Sparse Corrupted Multiple Measurement Vectors

Authors: Natalie Durgin, Rachel Grotheer, Chenxi Huang, Shuang Li, Anna Ma, Deanna Needell, Jing Qin

Abstract: While single measurement vector (SMV) models have been widely studied in signal processing, there is a surging interest in addressing the multiple measurement vectors (MMV) problem. In the MMV setting, more than one measurement vector is available and the multiple signals to be recovered share some commonalities such as a common support. Applications in which MMV is a naturally occurring phenomeno… ▽ More While single measurement vector (SMV) models have been widely studied in signal processing, there is a surging interest in addressing the multiple measurement vectors (MMV) problem. In the MMV setting, more than one measurement vector is available and the multiple signals to be recovered share some commonalities such as a common support. Applications in which MMV is a naturally occurring phenomenon include online streaming, medical imaging, and video recovery. This work presents a stochastic iterative algorithm for the support recovery of jointly sparse corrupted MMV. We present a variant of the Sparse Randomized Kaczmarz algorithm for corrupted MMV and compare our proposed method with an existing Kaczmarz type algorithm for MMV problems. We also showcase the usefulness of our approach in the online (streaming) setting and provide empirical evidence that suggests the robustness of the proposed method to the distribution of the corruption and the number of corruptions occurring. △ Less

Submitted 14 June, 2018; v1 submitted 7 November, 2017; originally announced November 2017.

Comments: 13 pages, 6 figures

Showing 1–19 of 19 results for author: Maa, A