-
Principal Component Analysis based frameworks for efficient missing data imputation algorithms
Authors:
Thu Nguyen,
Hoang Thien Ly,
Michael Alexander Riegler,
Pål Halvorsen,
Hugo L. Hammer
Abstract:
Missing data is a commonly occurring problem in practice. Many imputation methods have been developed to fill in the missing entries. However, not all of them can scale to high-dimensional data, especially the multiple imputation techniques. Meanwhile, the data nowadays tends toward high-dimensional. Therefore, in this work, we propose Principal Component Analysis Imputation (PCAI), a simple but v…
▽ More
Missing data is a commonly occurring problem in practice. Many imputation methods have been developed to fill in the missing entries. However, not all of them can scale to high-dimensional data, especially the multiple imputation techniques. Meanwhile, the data nowadays tends toward high-dimensional. Therefore, in this work, we propose Principal Component Analysis Imputation (PCAI), a simple but versatile framework based on Principal Component Analysis (PCA) to speed up the imputation process and alleviate memory issues of many available imputation techniques, without sacrificing the imputation quality in term of MSE. In addition, the frameworks can be used even when some or all of the missing features are categorical, or when the number of missing features is large. Next, we introduce PCA Imputation - Classification (PIC), an application of PCAI for classification problems with some adjustments. We validate our approach by experiments on various scenarios, which shows that PCAI and PIC can work with various imputation algorithms, including the state-of-the-art ones and improve the imputation speed significantly, while achieving competitive mean square error/classification accuracy compared to direct imputation (i.e., impute directly on the missing data).
△ Less
Submitted 19 March, 2023; v1 submitted 30 May, 2022;
originally announced May 2022.
-
The Complex-Pole Filter Representation (COFRE) for spectral modeling of fNIRS signals
Authors:
Marco A. Pinto Orellana,
Peyman Mirtaheri,
Hugo L. Hammer
Abstract:
The complex-pole frequency representation (COFRE) is introduced in this paper as a new approach for spectrum modeling in biomedical signals. Our method allows us to estimate the spectral power density at precise frequencies using an array of narrow band-pass filters with single complex poles. Closed-form expressions for the frequency resolution and transient time response of the proposed filters h…
▽ More
The complex-pole frequency representation (COFRE) is introduced in this paper as a new approach for spectrum modeling in biomedical signals. Our method allows us to estimate the spectral power density at precise frequencies using an array of narrow band-pass filters with single complex poles. Closed-form expressions for the frequency resolution and transient time response of the proposed filters have also been formulated. In addition, COFRE filters have a constant time and space complexity allowing their use in real-time environments. Our model was applied to identify frequency markers that characterize tinnitus in very-low-frequency oscillations within functional near-infrared spectroscopy (fNIRS) signals. We examined data from six patients with subjective tinnitus and seven healthy participants as a control group. A significant decrease in the spectrum power was observed in tinnitus patients in the left temporal lobe. In particular, we identified several tinnitus signatures in the spectral hemodynamic information, including (a.) a significant spectrum difference in one specific harmonic in the metabolic/endothelial frequency region, at 7mHz, for both chromophores and hemispheres; and (b.) a significant differences in the range 30-50mHz in the neurogenic/myogenic band.
△ Less
Submitted 13 May, 2021;
originally announced May 2021.
-
Dyadic aggregated autoregressive (DASAR) model for time-frequency representation of biomedical signals
Authors:
Marco A. Pinto-Orellana,
Habib Sherkat,
Peyman Mirtaheri,
Hugo L. Hammer
Abstract:
This paper introduces a new time-frequency representation method for biomedical signals: the dyadic aggregated autoregressive (DASAR) model. Signals, such as electroencephalograms (EEGs) and functional near-infrared spectroscopy (fNIRS), exhibit physiological information through time-evolving spectrum components at specific frequency intervals: 0-50 Hz (EEG) or 0-150 mHz (fNIRS). Spectrotemporal f…
▽ More
This paper introduces a new time-frequency representation method for biomedical signals: the dyadic aggregated autoregressive (DASAR) model. Signals, such as electroencephalograms (EEGs) and functional near-infrared spectroscopy (fNIRS), exhibit physiological information through time-evolving spectrum components at specific frequency intervals: 0-50 Hz (EEG) or 0-150 mHz (fNIRS). Spectrotemporal features in signals are conventionally estimated using short-time Fourier transform (STFT) and wavelet transform (WT). However, both methods may not offer the most robust or compact representation despite their widespread use in biomedical contexts. The presented method, DASAR, improves precise frequency identification and tracking of interpretable frequency components with a parsimonious set of parameters. DASAR achieves these characteristics by assuming that the biomedical time-varying spectrum comprises several independent stochastic oscillators with (piecewise) time-varying frequencies. Local stationarity can be assumed within dyadic subdivisions of the recordings, while the stochastic oscillators can be modeled with an aggregation of second-order autoregressive models (ASAR). DASAR can provide a more accurate representation of the (highly contrasted) EEG and fNIRS frequency ranges by increasing the estimation accuracy in user-defined spectrum region of interest (SROI). A mental arithmetic experiment on a hybrid EEG-fNIRS was conducted to assess the efficiency of the method. Our proposed technique, STFT, and WT were applied on both biomedical signals to discover potential oscillators that improve the discrimination between the task condition and its baseline. The results show that DASAR provided the highest spectrum differentiation and it was the only method that could identify Mayer waves as narrow-band artifacts at 97.4-97.5 mHz.
△ Less
Submitted 13 May, 2021;
originally announced May 2021.
-
SCAU: Modeling spectral causality for multivariate time series with applications to electroencephalograms
Authors:
Marco Antonio Pinto-Orellana,
Peyman Mirtaheri,
Hugo L. Hammer,
Hernando Ombao
Abstract:
Electroencephalograms (EEG) are noninvasive measurement signals of electrical neuronal activity in the brain. One of the current major statistical challenges is formally measuring functional dependency between those complex signals. This paper, proposes the spectral causality model (SCAU), a robust linear model, under a causality paradigm, to reflect inter- and intra-frequency modulation effects t…
▽ More
Electroencephalograms (EEG) are noninvasive measurement signals of electrical neuronal activity in the brain. One of the current major statistical challenges is formally measuring functional dependency between those complex signals. This paper, proposes the spectral causality model (SCAU), a robust linear model, under a causality paradigm, to reflect inter- and intra-frequency modulation effects that cannot be identifiable using other methods. SCAU inference is conducted with three main steps: (a) signal decomposition into frequency bins, (b) intermediate spectral band mapping, and (c) dependency modeling through frequency-specific autoregressive models (VAR). We apply SCAU to study complex dependencies during visual and lexical fluency tasks (word generation and visual fixation) in 26 participants' EEGs. We compared the connectivity networks estimated using SCAU with respect to a VAR model. SCAU networks show a clear contrast for both stimuli while the magnitude links also denoted a low variance in comparison with the VAR networks. Furthermore, SCAU dependency connections not only were consistent with findings in the neuroscience literature, but it also provided further evidence on the directionality of the spatio-spectral dependencies such as the delta-originated and theta-induced links in the fronto-temporal brain network.
△ Less
Submitted 13 May, 2021;
originally announced May 2021.
-
An Extensive Study on Cross-Dataset Bias and Evaluation Metrics Interpretation for Machine Learning applied to Gastrointestinal Tract Abnormality Classification
Authors:
Vajira Thambawita,
Debesh Jha,
Hugo Lewi Hammer,
Håvard D. Johansen,
Dag Johansen,
Pål Halvorsen,
Michael A. Riegler
Abstract:
Precise and efficient automated identification of Gastrointestinal (GI) tract diseases can help doctors treat more patients and improve the rate of disease detection and identification. Currently, automatic analysis of diseases in the GI tract is a hot topic in both computer science and medical-related journals. Nevertheless, the evaluation of such an automatic analysis is often incomplete or simp…
▽ More
Precise and efficient automated identification of Gastrointestinal (GI) tract diseases can help doctors treat more patients and improve the rate of disease detection and identification. Currently, automatic analysis of diseases in the GI tract is a hot topic in both computer science and medical-related journals. Nevertheless, the evaluation of such an automatic analysis is often incomplete or simply wrong. Algorithms are often only tested on small and biased datasets, and cross-dataset evaluations are rarely performed. A clear understanding of evaluation metrics and machine learning models with cross datasets is crucial to bring research in the field to a new quality level. Towards this goal, we present comprehensive evaluations of five distinct machine learning models using Global Features and Deep Neural Networks that can classify 16 different key types of GI tract conditions, including pathological findings, anatomical landmarks, polyp removal conditions, and normal findings from images captured by common GI tract examination instruments. In our evaluation, we introduce performance hexagons using six performance metrics such as recall, precision, specificity, accuracy, F1-score, and Matthews Correlation Coefficient to demonstrate how to determine the real capabilities of models rather than evaluating them shallowly. Furthermore, we perform cross-dataset evaluations using different datasets for training and testing. With these cross-dataset evaluations, we demonstrate the challenge of actually building a generalizable model that could be used across different hospitals. Our experiments clearly show that more sophisticated performance metrics and evaluation methods need to be applied to get reliable models rather than depending on evaluations of the splits of the same dataset, i.e., the performance metrics should always be interpreted together rather than relying on a single metric.
△ Less
Submitted 8 May, 2020;
originally announced May 2020.
-
Efficient Quantile Tracking Using an Oracle
Authors:
Hugo L. Hammer,
Anis Yazidi,
Michael A. Riegler,
Håvard Rue
Abstract:
For incremental quantile estimators the step size and possibly other tuning parameters must be carefully set. However, little attention has been given on how to set these values in an online manner. In this article we suggest two novel procedures that address this issue.
The core part of the procedures is to estimate the current tracking mean squared error (MSE). The MSE is decomposed in trackin…
▽ More
For incremental quantile estimators the step size and possibly other tuning parameters must be carefully set. However, little attention has been given on how to set these values in an online manner. In this article we suggest two novel procedures that address this issue.
The core part of the procedures is to estimate the current tracking mean squared error (MSE). The MSE is decomposed in tracking variance and bias and novel and efficient procedures to estimate these quantities are presented. It is shown that estimation bias can be tracked by associating it with the portion of observations below the quantile estimates.
The first procedure runs an ensemble of $L$ quantile estimators for wide range of values of the tuning parameters and typically around $L = 100$. In each iteration an oracle selects the best estimate by the guidance of the estimated MSEs. The second method only runs an ensemble of $L = 3$ estimators and thus the values of the tuning parameters need from time to time to be adjusted for the running estimators. The procedures have a low memory foot print of $8L$ and a computational complexity of $8L$ per iteration.
The experiments show that the procedures are highly efficient and track quantiles with an error close to the theoretical optimum. The Oracle approach performs best, but comes with higher computational cost. The procedures were further applied to a massive real-life data stream of tweets and proofed real world applicability of them.
△ Less
Submitted 27 April, 2020;
originally announced April 2020.
-
Estimating Tukey Depth Using Incremental Quantile Estimators
Authors:
Hugo Lewi Hammer,
Anis Yazidi,
Håvard Rue
Abstract:
The concept of depth represents methods to measure how deep an arbitrary point is positioned in a dataset and can be seen as the opposite of outlyingness. It has proved very useful and a wide range of methods have been developed based on the concept.
To address the well-known computational challenges associated with the depth concept, we suggest to estimate Tukey depth contours using recently de…
▽ More
The concept of depth represents methods to measure how deep an arbitrary point is positioned in a dataset and can be seen as the opposite of outlyingness. It has proved very useful and a wide range of methods have been developed based on the concept.
To address the well-known computational challenges associated with the depth concept, we suggest to estimate Tukey depth contours using recently developed incremental quantile estimators. The suggested algorithm can estimate depth contours when the dataset in known in advance, but also recursively update and even track Tukey depth contours for dynamically varying data stream distributions. Tracking was demonstrated in a real-life data example where changes in human activity was detected in real-time from accelerometer observations.
△ Less
Submitted 8 January, 2020;
originally announced January 2020.
-
Machine Learning-Based Analysis of Sperm Videos and Participant Data for Male Fertility Prediction
Authors:
Steven A. Hicks,
Jorunn M. Andersen,
Oliwia Witczak,
Vajira Thambawita,
Påll Halvorsen,
Hugo L. Hammer,
Trine B. Haugen,
Michael A. Riegler
Abstract:
Methods for automatic analysis of clinical data are usually targeted towards a specific modality and do not make use of all relevant data available. In the field of male human reproduction, clinical and biological data are not used to its fullest potential. Manual evaluation of a semen sample using a microscope is time-consuming and requires extensive training. Furthermore, the validity of manual…
▽ More
Methods for automatic analysis of clinical data are usually targeted towards a specific modality and do not make use of all relevant data available. In the field of male human reproduction, clinical and biological data are not used to its fullest potential. Manual evaluation of a semen sample using a microscope is time-consuming and requires extensive training. Furthermore, the validity of manual semen analysis has been questioned due to limited reproducibility, and often high inter-personnel variation. The existing computer-aided sperm analyzer systems are not recommended for routine clinical use due to methodological challenges caused by the consistency of the semen sample. Thus, there is a need for an improved methodology. We use modern and classical machine learning techniques together with a dataset consisting of 85 videos of human semen samples and related participant data to automatically predict sperm motility. Used techniques include simple linear regression and more sophisticated methods using convolutional neural networks. Our results indicate that sperm motility prediction based on deep learning using sperm motility videos is rapid to perform and consistent. The algorithms performed worse when participant data was added. In conclusion, machine learning-based automatic analysis may become a valuable tool in male infertility investigation and research.
△ Less
Submitted 29 October, 2019;
originally announced October 2019.
-
Joint Tracking of Multiple Quantiles Through Conditional Quantiles
Authors:
Hugo Lewi Hammer,
Anis Yazidi,
Håvard Rue
Abstract:
Estimation of quantiles is one of the most fundamental real-time analysis tasks. Most real-time data streams vary dynamically with time and incremental quantile estimators document state-of-the art performance to track quantiles of such data streams. However, most are not able to make joint estimates of multiple quantiles in a consistent manner, and estimates may violate the monotone property of q…
▽ More
Estimation of quantiles is one of the most fundamental real-time analysis tasks. Most real-time data streams vary dynamically with time and incremental quantile estimators document state-of-the art performance to track quantiles of such data streams. However, most are not able to make joint estimates of multiple quantiles in a consistent manner, and estimates may violate the monotone property of quantiles. In this paper we propose the general concept of *conditional quantiles* that can extend incremental estimators to jointly track multiple quantiles. We apply the concept to propose two new estimators. Extensive experimental results, on both synthetic and real-life data, show that the new estimators clearly outperform legacy state-of-the-art joint quantile tracking algorithm and achieve faster adaptivity in dynamically varying data streams.
△ Less
Submitted 13 February, 2019;
originally announced February 2019.
-
Statistical models for short and long term forecasts of snow depth
Authors:
Hugo Lewi Hammer
Abstract:
Forecasting of future snow depths is useful for many applications like road safety, winter sport activities, avalanche risk assessment and hydrology. Motivated by the lack of statistical forecasts models for snow depth, in this paper we present a set of models to fill this gap. First, we present a model to do short term forecasts when we assume that reliable weather forecasts of air temperature an…
▽ More
Forecasting of future snow depths is useful for many applications like road safety, winter sport activities, avalanche risk assessment and hydrology. Motivated by the lack of statistical forecasts models for snow depth, in this paper we present a set of models to fill this gap. First, we present a model to do short term forecasts when we assume that reliable weather forecasts of air temperature and precipitation are available. The covariates are included nonlinearly into the model following basic physical principles of snowfall, snow aging and melting. Due to the large set of observations with snow depth equal to zero, we use a zero-inflated gamma regression model, which is commonly used to similar applications like precipitation. We also do long term forecasts of snow depth and much further than traditional weather forecasts for temperature and precipitation. The long-term forecasts are based on fitting models to historic time series of precipitation, temperature and snow depth. We fit the models to data from three locations in Norway with different climatic properties. Forecasting five days into the future, the results showed that, given reliable weather forecasts of temperature and precipitation, the forecast errors in absolute value was between 3 and 7 cm for different locations in Norway. Forecasting three weeks into the future, the forecast errors were between 7 and 16 cm.
△ Less
Submitted 15 January, 2019;
originally announced January 2019.
-
Quantile Tracking in Dynamically Varying Data Streams Using a Generalized Exponentially Weighted Average of Observations
Authors:
Hugo Lewi Hammer,
Anis Yazidi,
Håvard Rue
Abstract:
The Exponentially Weighted Average (EWA) of observations is known to be state-of-art estimator for tracking expectations of dynamically varying data stream distributions. However, how to devise an EWA estimator to rather track quantiles of data stream distributions is not obvious. In this paper, we present a lightweight quantile estimator using a generalized form of the EWA. To the best of our kno…
▽ More
The Exponentially Weighted Average (EWA) of observations is known to be state-of-art estimator for tracking expectations of dynamically varying data stream distributions. However, how to devise an EWA estimator to rather track quantiles of data stream distributions is not obvious. In this paper, we present a lightweight quantile estimator using a generalized form of the EWA. To the best of our knowledge, this work represents the first reported quantile estimator of this form in the literature. An appealing property of the estimator is that the update step size is adjusted online proportionally to the difference between current observation and the current quantile estimate. Thus, if the estimator is off-track compared to the data stream, large steps will be taken to promptly get the estimator back on-track. The convergence of the estimator to the true quantile is proven using the theory of stochastic learning.
Extensive experimental results using both synthetic and real-life data show that our estimator clearly outperforms legacy state-of-the-art quantile tracking estimators and achieves faster adaptivity in dynamic environments. The quantile estimator was further tested on real-life data where the objective is efficient online control of indoor climate. We show that the estimator can be incorporated into a concept drift detector for efficiently decide when a machine learning model used to predict future indoor temperature should be retrained/updated.
△ Less
Submitted 15 January, 2019;
originally announced January 2019.
-
Parameter Estimation in Abruptly Changing Dynamic Environments
Authors:
Hugo Lewi Hammer,
Anis Yazidi
Abstract:
Many real-life dynamical systems change abruptly followed by almost stationary periods. In this paper, we consider streams of data with such abrupt behavior and investigate the problem of tracking their statistical properties in an online manner.
We devise a tracking procedure where an estimator that is suitable for a stationary environment is combined together with an event detection method suc…
▽ More
Many real-life dynamical systems change abruptly followed by almost stationary periods. In this paper, we consider streams of data with such abrupt behavior and investigate the problem of tracking their statistical properties in an online manner.
We devise a tracking procedure where an estimator that is suitable for a stationary environment is combined together with an event detection method such that the estimator rapidly can jump to a more suitable value if an event is detected. Combining an estimation procedure with detection procedure is commonly known idea in the literature. However, our contribution lies in building the detection procedure based on the difference between the stationary estimator and a Stochastic Learning Weak Estimator (SLWE). The SLWE estimator is known to be the state-of-the art approach to tracking properties of non-stationary environments and thus should be a better choice to detect changes in abruptly changing environments than the far more common sliding window based approaches. To the best of our knowledge, the event detection procedure suggested by Ross et al. (2012) is the only procedure in the literature taking advantage of the powerful tracking properties of the SLWE estimator. The procedure in Ross et al. is however quite complex and not well founded theoretically compared to the procedures in this paper. In this paper, we focus on estimation procedure for the binomial and multinomial distributions, but our approach can be easily generalized to cover other distributions as well.
Extensive simulation results based on both synthetic and real-life data related to news classification demonstrate that our estimation procedure is easy to tune and performs well.
△ Less
Submitted 15 January, 2019;
originally announced January 2019.
-
The Medico-Task 2018: Disease Detection in the Gastrointestinal Tract using Global Features and Deep Learning
Authors:
Vajira Thambawita,
Debesh Jha,
Michael Riegler,
Pål Halvorsen,
Hugo Lewi Hammer,
Håvard D. Johansen,
Dag Johansen
Abstract:
In this paper, we present our approach for the 2018 Medico Task classifying diseases in the gastrointestinal tract. We have proposed a system based on global features and deep neural networks. The best approach combines two neural networks, and the reproducible experimental results signify the efficiency of the proposed model with an accuracy rate of 95.80%, a precision of 95.87%, and an F1-score…
▽ More
In this paper, we present our approach for the 2018 Medico Task classifying diseases in the gastrointestinal tract. We have proposed a system based on global features and deep neural networks. The best approach combines two neural networks, and the reproducible experimental results signify the efficiency of the proposed model with an accuracy rate of 95.80%, a precision of 95.87%, and an F1-score of 95.80%.
△ Less
Submitted 31 October, 2018;
originally announced October 2018.
-
Estimation of Multiple Quantiles in Dynamically Varying Data Streams
Authors:
Hugo Lewi Hammer,
Anis Yazidi,
Håvard Rue
Abstract:
In this paper we consider the problem of estimating quantiles when data are received sequentially (data stream). For real life data streams, the distribution of the data typically varies with time making estimation of quantiles challenging. We present a method that simultaneously maintain estimates of multiple quantiles of the data stream distribution. The method is based on making incremental upd…
▽ More
In this paper we consider the problem of estimating quantiles when data are received sequentially (data stream). For real life data streams, the distribution of the data typically varies with time making estimation of quantiles challenging. We present a method that simultaneously maintain estimates of multiple quantiles of the data stream distribution. The method is based on making incremental updates of the quantile estimates every time a new sample from the data stream is received. The method is memory and computationally efficient since it only stores one value for each quantile estimate and only performs one operation per quantile estimate when a new sample is received from the data stream. The estimates are realistic in the sense that the monotone property of quantiles is satisfied in every iteration. Experiments show that the method efficiently tracks multiple quantiles and outperforms state of the art methods.
△ Less
Submitted 31 January, 2017;
originally announced February 2017.