Search | arXiv e-print repository

arXiv:2505.19635 [pdf, ps, other]

When fractional quasi p-norms concentrate

Authors: Ivan Y. Tyukin, Bogdan Grechuk, Evgeny M. Mirkes, Alexander N. Gorban

Abstract: Concentration of distances in high dimension is an important factor for the development and design of stable and reliable data analysis algorithms. In this paper, we address the fundamental long-standing question about the concentration of distances in high dimension for fractional quasi $p$-norms, $p\in(0,1)$. The topic has been at the centre of various theoretical and empirical controversies. He… ▽ More Concentration of distances in high dimension is an important factor for the development and design of stable and reliable data analysis algorithms. In this paper, we address the fundamental long-standing question about the concentration of distances in high dimension for fractional quasi $p$-norms, $p\in(0,1)$. The topic has been at the centre of various theoretical and empirical controversies. Here we, for the first time, identify conditions when fractional quasi $p$-norms concentrate and when they don't. We show that contrary to some earlier suggestions, for broad classes of distributions, fractional quasi $p$-norms admit exponential and uniform in $p$ concentration bounds. For these distributions, the results effectively rule out previously proposed approaches to alleviate concentration by "optimal" setting the values of $p$ in $(0,1)$. At the same time, we specify conditions and the corresponding families of distributions for which one can still control concentration rates by appropriate choices of $p$. We also show that in an arbitrarily small vicinity of a distribution from a large class of distributions for which uniform concentration occurs, there are uncountably many other distributions featuring anti-concentration properties. Importantly, this behavior enables devising relevant data encoding or representation schemes favouring or discouraging distance concentration. The results shed new light on this long-standing problem and resolve the tension around the topic in both theory and empirical evidence reported in the literature. △ Less

Submitted 26 May, 2025; originally announced May 2025.

MSC Class: 68T09; 62R07; 94A16

arXiv:2402.06563 [pdf]

doi 10.1109/BigData59044.2023.10386194

What is Hiding in Medicine's Dark Matter? Learning with Missing Data in Medical Practices

Authors: Neslihan Suzen, Evgeny M. Mirkes, Damian Roland, Jeremy Levesley, Alexander N. Gorban, Tim J. Coats

Abstract: Electronic patient records (EPRs) produce a wealth of data but contain significant missing information. Understanding and handling this missing data is an important part of clinical data analysis and if left unaddressed could result in bias in analysis and distortion in critical conclusions. Missing data may be linked to health care professional practice patterns and imputation of missing data can… ▽ More Electronic patient records (EPRs) produce a wealth of data but contain significant missing information. Understanding and handling this missing data is an important part of clinical data analysis and if left unaddressed could result in bias in analysis and distortion in critical conclusions. Missing data may be linked to health care professional practice patterns and imputation of missing data can increase the validity of clinical decisions. This study focuses on statistical approaches for understanding and interpreting the missing data and machine learning based clinical data imputation using a single centre's paediatric emergency data and the data from UK's largest clinical audit for traumatic injury database (TARN). In the study of 56,961 data points related to initial vital signs and observations taken on children presenting to an Emergency Department, we have shown that missing data are likely to be non-random and how these are linked to health care professional practice patterns. We have then examined 79 TARN fields with missing values for 5,791 trauma cases. Singular Value Decomposition (SVD) and k-Nearest Neighbour (kNN) based missing data imputation methods are used and imputation results against the original dataset are compared and statistically tested. We have concluded that the 1NN imputer is the best imputation which indicates a usual pattern of clinical decision making: find the most similar patients and take their attributes as imputation. △ Less

Submitted 9 February, 2024; originally announced February 2024.

Comments: 8 pages

Journal ref: 2023 IEEE International Conference on Big Data (BigData), 4979-4986

arXiv:2402.00899 [pdf, other]

Weakly Supervised Learners for Correction of AI Errors with Provable Performance Guarantees

Authors: Ivan Y. Tyukin, Tatiana Tyukina, Daniel van Helden, Zedong Zheng, Evgeny M. Mirkes, Oliver J. Sutton, Qinghua Zhou, Alexander N. Gorban, Penelope Allison

Abstract: We present a new methodology for handling AI errors by introducing weakly supervised AI error correctors with a priori performance guarantees. These AI correctors are auxiliary maps whose role is to moderate the decisions of some previously constructed underlying classifier by either approving or rejecting its decisions. The rejection of a decision can be used as a signal to suggest abstaining fro… ▽ More We present a new methodology for handling AI errors by introducing weakly supervised AI error correctors with a priori performance guarantees. These AI correctors are auxiliary maps whose role is to moderate the decisions of some previously constructed underlying classifier by either approving or rejecting its decisions. The rejection of a decision can be used as a signal to suggest abstaining from making a decision. A key technical focus of the work is in providing performance guarantees for these new AI correctors through bounds on the probabilities of incorrect decisions. These bounds are distribution agnostic and do not rely on assumptions on the data dimension. Our empirical example illustrates how the framework can be applied to improve the performance of an image classifier in a challenging real-world task where training data are scarce. △ Less

Submitted 13 February, 2024; v1 submitted 31 January, 2024; originally announced February 2024.

MSC Class: 68T05; 68T37

arXiv:2208.13290 [pdf, other]

doi 10.3390/e25010033

Domain Adaptation Principal Component Analysis: base linear method for learning with out-of-distribution data

Authors: Evgeny M Mirkes, Jonathan Bac, Aziz Fouché, Sergey V. Stasenko, Andrei Zinovyev, Alexander N. Gorban

Abstract: Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the problem of divergence (or shift) between the labeled training and validation datasets (source domain) and a potentially large unlabeled dataset (target domain). The task is to embed both datasets red into a common space in which the source dataset is informative for training while the divergence between s… ▽ More Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the problem of divergence (or shift) between the labeled training and validation datasets (source domain) and a potentially large unlabeled dataset (target domain). The task is to embed both datasets red into a common space in which the source dataset is informative for training while the divergence between source and target is minimized. The most popular domain adaptation solutions are based on training neural networks that combine classification and adversarial learning modules, frequently making them both data-hungry and difficult to train. We present a method called Domain Adaptation Principal Component Analysis (DAPCA) that identifies a linear reduced data representation useful for solving the domain adaptation task. DAPCA algorithm introduces positive and negative weights between pairs of data points, and generalizes the supervised extension of principal component analysis. DAPCA is an iterative algorithm that solves a simple quadratic optimization problem at each iteration. The convergence of the algorithm is guaranteed, and the number of iterations is small in practice. We validate the suggested algorithm on previously proposed benchmarks for solving the domain adaptation task. We also show the benefit of using DAPCA in analyzing the single-cell omics datasets in biomedical applications. Overall, DAPCA can serve as a practical preprocessing step in many machine learning applications leading to reduced dataset representations, taking into account possible divergence between source and target domains. △ Less

Submitted 15 December, 2022; v1 submitted 28 August, 2022; originally announced August 2022.

Journal ref: Entropy, 25(1), 33, 2023

arXiv:2205.15696 [pdf]

An Informational Space Based Semantic Analysis for Scientific Texts

Authors: Neslihan Suzen, Alexander N. Gorban, Jeremy Levesley, Evgeny M. Mirkes

Abstract: One major problem in Natural Language Processing is the automatic analysis and representation of human language. Human language is ambiguous and deeper understanding of semantics and creating human-to-machine interaction have required an effort in creating the schemes for act of communication and building common-sense knowledge bases for the 'meaning' in texts. This paper introduces computational… ▽ More One major problem in Natural Language Processing is the automatic analysis and representation of human language. Human language is ambiguous and deeper understanding of semantics and creating human-to-machine interaction have required an effort in creating the schemes for act of communication and building common-sense knowledge bases for the 'meaning' in texts. This paper introduces computational methods for semantic analysis and the quantifying the meaning of short scientific texts. Computational methods extracting semantic feature are used to analyse the relations between texts of messages and 'representations of situations' for a newly created large collection of scientific texts, Leicester Scientific Corpus. The representation of scientific-specific meaning is standardised by replacing the situation representations, rather than psychological properties, with the vectors of some attributes: a list of scientific subject categories that the text belongs to. First, this paper introduces 'Meaning Space' in which the informational representation of the meaning is extracted from the occurrence of the word in texts across the scientific categories, i.e., the meaning of a word is represented by a vector of Relative Information Gain about the subject categories. Then, the meaning space is statistically analysed for Leicester Scientific Dictionary-Core and we investigate 'Principal Components of the Meaning' to describe the adequate dimensions of the meaning. The research in this paper conducts the base for the geometric representation of the meaning of texts. △ Less

Submitted 31 May, 2022; originally announced May 2022.

Comments: 19 pages. arXiv admin note: substantial text overlap with arXiv:2009.08859, arXiv:2004.13717

Journal ref: Computer Science & Information Technology, volume 12, number 08, pp. 81-99, 2022. CS & IT - CSCP 2022

arXiv:2203.16687 [pdf, other]

Quasi-orthogonality and intrinsic dimensions as measures of learning and generalisation

Authors: Qinghua Zhou, Alexander N. Gorban, Evgeny M. Mirkes, Jonathan Bac, Andrei Zinovyev, Ivan Y. Tyukin

Abstract: Finding best architectures of learning machines, such as deep neural networks, is a well-known technical and theoretical challenge. Recent work by Mellor et al (2021) showed that there may exist correlations between the accuracies of trained networks and the values of some easily computable measures defined on randomly initialised networks which may enable to search tens of thousands of neural arc… ▽ More Finding best architectures of learning machines, such as deep neural networks, is a well-known technical and theoretical challenge. Recent work by Mellor et al (2021) showed that there may exist correlations between the accuracies of trained networks and the values of some easily computable measures defined on randomly initialised networks which may enable to search tens of thousands of neural architectures without training. Mellor et al used the Hamming distance evaluated over all ReLU neurons as such a measure. Motivated by these findings, in our work, we ask the question of the existence of other and perhaps more principled measures which could be used as determinants of success of a given neural architecture. In particular, we examine, if the dimensionality and quasi-orthogonality of neural networks' feature space could be correlated with the network's performance after training. We showed, using the setup as in Mellor et al, that dimensionality and quasi-orthogonality may jointly serve as network's performance discriminants. In addition to offering new opportunities to accelerate neural architecture search, our findings suggest important relationships between the networks' final performance and properties of their randomly initialised feature spaces: data dimension and quasi-orthogonality. △ Less

Submitted 30 March, 2022; originally announced March 2022.

MSC Class: 68T05; 68Q32

arXiv:2109.02596 [pdf, other]

doi 10.3390/e23101368

Scikit-dimension: a Python package for intrinsic dimension estimation

Authors: Jonathan Bac, Evgeny M. Mirkes, Alexander N. Gorban, Ivan Tyukin, Andrei Zinovyev

Abstract: Dealing with uncertainty in applications of machine learning to real-life data critically depends on the knowledge of intrinsic dimensionality (ID). A number of methods have been suggested for the purpose of estimating ID, but no standard package to easily apply them one by one or all at once has been implemented in Python. This technical note introduces \texttt{scikit-dimension}, an open-source P… ▽ More Dealing with uncertainty in applications of machine learning to real-life data critically depends on the knowledge of intrinsic dimensionality (ID). A number of methods have been suggested for the purpose of estimating ID, but no standard package to easily apply them one by one or all at once has been implemented in Python. This technical note introduces \texttt{scikit-dimension}, an open-source Python package for intrinsic dimension estimation. \texttt{scikit-dimension} package provides a uniform implementation of most of the known ID estimators based on scikit-learn application programming interface to evaluate global and local intrinsic dimension, as well as generators of synthetic toy and benchmark datasets widespread in the literature. The package is developed with tools assessing the code quality, coverage, unit testing and continuous integration. We briefly describe the package and demonstrate its use in a large-scale (more than 500 datasets) benchmarking of methods for ID estimation in real-life and synthetic data. The source code is available from https://github.com/j-bac/scikit-dimension , the documentation is available from https://scikit-dimension.readthedocs.io . △ Less

Submitted 6 September, 2021; originally announced September 2021.

Comments: 12 pages, 4 figures, 1 table

Journal ref: Entropy, 2021, 23(10), 1368

arXiv:2107.01401 [pdf, other]

doi 10.3390/e23091140

Learning from scarce information: using synthetic data to classify Roman fine ware pottery

Authors: Santos J. Núñez Jareño, Daniël P. van Helden, Evgeny M. Mirkes, Ivan Y. Tyukin, Penelope M. Allison

Abstract: In this article we consider a version of the challenging problem of learning from datasets whose size is too limited to allow generalisation beyond the training set. To address the challenge we propose to use a transfer learning approach whereby the model is first trained on a synthetic dataset replicating features of the original objects. In this study the objects were smartphone photographs of n… ▽ More In this article we consider a version of the challenging problem of learning from datasets whose size is too limited to allow generalisation beyond the training set. To address the challenge we propose to use a transfer learning approach whereby the model is first trained on a synthetic dataset replicating features of the original objects. In this study the objects were smartphone photographs of near-complete Roman terra sigillata pottery vessels from the collection of the Museum of London. Taking the replicated features from published profile drawings of pottery forms allowed the integration of expert knowledge into the process through our synthetic data generator. After this first initial training the model was fine-tuned with data from photographs of real vessels. We show, through exhaustive experiments across several popular deep learning architectures, different test priors, and considering the impact of the photograph viewpoint and excessive damage to the vessels, that the proposed hybrid approach enables the creation of classifiers with appropriate generalisation performance. This performance is significantly better than that of classifiers trained exclusively on the original data which shows the promise of the approach to alleviate the fundamental issue of learning from small datasets. △ Less

Submitted 3 July, 2021; originally announced July 2021.

MSC Class: 68T07; 68T45

arXiv:2106.15416 [pdf, other]

doi 10.3390/e23081090

High-dimensional separability for one- and few-shot learning

Authors: Alexander N. Gorban, Bogdan Grechuk, Evgeny M. Mirkes, Sergey V. Stasenko, Ivan Y. Tyukin

Abstract: This work is driven by a practical question: corrections of Artificial Intelligence (AI) errors. These corrections should be quick and non-iterative. To solve this problem without modification of a legacy AI system, we propose special `external' devices, correctors. Elementary correctors consist of two parts, a classifier that separates the situations with high risk of error from the situations in… ▽ More This work is driven by a practical question: corrections of Artificial Intelligence (AI) errors. These corrections should be quick and non-iterative. To solve this problem without modification of a legacy AI system, we propose special `external' devices, correctors. Elementary correctors consist of two parts, a classifier that separates the situations with high risk of error from the situations in which the legacy AI system works well and a new decision for situations with potential errors. Input signals for the correctors can be the inputs of the legacy AI system, its internal signals, and outputs. If the intrinsic dimensionality of data is high enough then the classifiers for correction of small number of errors can be very simple. According to the blessing of dimensionality effects, even simple and robust Fisher's discriminants can be used for one-shot learning of AI correctors. Stochastic separation theorems provide the mathematical basis for this one-short learning. However, as the number of correctors needed grows, the cluster structure of data becomes important and a new family of stochastic separation theorems is required. We refuse the classical hypothesis of the regularity of the data distribution and assume that the data can have a fine-grained structure with many clusters and peaks in the probability density. New stochastic separation theorems for data with fine-grained structure are formulated and proved. The multi-correctors for granular data are proposed. The advantages of the multi-corrector technology were demonstrated by examples of correcting errors and learning new classes of objects by a deep convolutional neural network on the CIFAR-10 dataset. The key problems of the non-classical high-dimensional data analysis are reviewed together with the basic preprocessing steps including supervised, semi-supervised and domain adaptation Principal Component Analysis. △ Less

Submitted 22 October, 2021; v1 submitted 28 June, 2021; originally announced June 2021.

Comments: Corrected and restructured version with some extensions

Journal ref: Entropy. 2021; 23(8):1090

arXiv:2106.08966 [pdf]

doi 10.1038/s41598-021-01317-z

Social stress drives the multi-wave dynamics of COVID-19 outbreaks

Authors: I. A. Kastalskiy, E. V. Pankratova, E. M. Mirkes, V. B. Kazantsev, A. N. Gorban

Abstract: The dynamics of epidemics depend on how people's behavior changes during an outbreak. At the beginning of the epidemic, people do not know about the virus, then, after the outbreak of epidemics and alarm, they begin to comply with the restrictions and the spreading of epidemics may decline. Over time, some people get tired/frustrated by the restrictions and stop following them (exhaustion), especi… ▽ More The dynamics of epidemics depend on how people's behavior changes during an outbreak. At the beginning of the epidemic, people do not know about the virus, then, after the outbreak of epidemics and alarm, they begin to comply with the restrictions and the spreading of epidemics may decline. Over time, some people get tired/frustrated by the restrictions and stop following them (exhaustion), especially if the number of new cases drops down. After resting for a while, they can follow the restrictions again. But during this pause the second wave can come and become even stronger then the first one. Studies based on SIR models do not predict the observed quick exit from the first wave of epidemics. Social dynamics should be considered. The appearance of the second wave also depends on social factors. Many generalizations of the SIR model have been developed that take into account the weakening of immunity over time, the evolution of the virus, vaccination and other medical and biological details. However, these more sophisticated models do not explain the apparent differences in outbreak profiles between countries with different intrinsic socio-cultural features. In our work, a system of models of the COVID-19 pandemic is proposed, combining the dynamics of social stress with classical epidemic models. Social stress is described by the tools of sociophysics. The combination of a dynamic SIR-type model with the classical triad of stages of the general adaptation syndrome, alarm-resistance-exhaustion, makes it possible to describe with high accuracy the available statistical data for 13 countries. The sets of kinetic constants corresponding to optimal fit of model to data were found. They characterize the ability of society to mobilize efforts against epidemics and maintain this concentration over time, and can further help in the development of strategies specific to a particular society. △ Less

Submitted 19 October, 2021; v1 submitted 16 June, 2021; originally announced June 2021.

Comments: Minor corrections, enriched discussion and extended bibliography

Journal ref: Sci Rep 11, 22497 (2021)

arXiv:2008.09303 [pdf]

doi 10.1109/TGRS.2021.3076011

Coloring Panchromatic Nighttime Satellite Images: Comparing the Performance of Several Machine Learning Methods

Authors: N. Rybnikova, B. A. Portnov, E. M. Mirkes, A. Zinovyev, A. Brook, A. N. Gorban

Abstract: Artificial light-at-night (ALAN), emitted from the ground and visible from space, marks human presence on Earth. Since the launch of the Suomi National Polar Partnership satellite with the Visible Infrared Imaging Radiometer Suite Day/Night Band (VIIRS/DNB) onboard, global nighttime images have significantly improved; however, they remained panchromatic. Although multispectral images are also avai… ▽ More Artificial light-at-night (ALAN), emitted from the ground and visible from space, marks human presence on Earth. Since the launch of the Suomi National Polar Partnership satellite with the Visible Infrared Imaging Radiometer Suite Day/Night Band (VIIRS/DNB) onboard, global nighttime images have significantly improved; however, they remained panchromatic. Although multispectral images are also available, they are either commercial or free of charge, but sporadic. In this paper, we use several machine learning techniques, such as linear, kernel, random forest regressions, and elastic map approach, to transform panchromatic VIIRS/DBN into Red Green Blue (RGB) images. To validate the proposed approach, we analyze RGB images for eight urban areas worldwide. We link RGB values, obtained from ISS photographs, to panchromatic ALAN intensities, their pixel-wise differences, and several land-use type proxies. Each dataset is used for model training, while other datasets are used for the model validation. The analysis shows that model-estimated RGB images demonstrate a high degree of correspondence with the original RGB images from the ISS database. Yet, estimates, based on linear, kernel and random forest regressions, provide better correlations, contrast similarity and lower WMSEs levels, while RGB images, generated using elastic map approach, provide higher consistency of predictions. △ Less

Submitted 10 April, 2021; v1 submitted 21 August, 2020; originally announced August 2020.

Journal ref: IEEE Transactions on Geoscience and Remote Sensing, 60, Art no. 4702715. 2022

arXiv:2007.03788 [pdf, other]

doi 10.1093/gigascience/giaa128

Trajectories, bifurcations and pseudotime in large clinical datasets: applications to myocardial infarction and diabetes data

Authors: Sergey E. Golovenkin, Jonathan Bac, Alexander Chervov, Evgeny M. Mirkes, Yuliya V. Orlova, Emmanuel Barillot, Alexander N. Gorban, Andrei Zinovyev

Abstract: Large observational clinical datasets become increasingly available for mining associations between various disease traits and administered therapy. These datasets can be considered as representations of the landscape of all possible disease conditions, in which a concrete pathology develops through a number of stereotypical routes, characterized by `points of no return' and `final states' (such a… ▽ More Large observational clinical datasets become increasingly available for mining associations between various disease traits and administered therapy. These datasets can be considered as representations of the landscape of all possible disease conditions, in which a concrete pathology develops through a number of stereotypical routes, characterized by `points of no return' and `final states' (such as lethal or recovery states). Extracting this information directly from the data remains challenging, especially in the case of synchronic (with a short-term follow up) observations. Here we suggest a semi-supervised methodology for the analysis of large clinical datasets, characterized by mixed data types and missing values, through modeling the geometrical data structure as a bouquet of bifurcating clinical trajectories. The methodology is based on application of elastic principal graphs which can address simultaneously the tasks of dimensionality reduction, data visualization, clustering, feature selection and quantifying the geodesic distances (pseudotime) in partially ordered sequences of observations. The methodology allows positioning a patient on a particular clinical trajectory (pathological scenario) and characterizing the degree of progression along it with a qualitative estimate of the uncertainty of the prognosis. Overall, our pseudo-time quantification-based approach gives a possibility to apply the methods developed for dynamical disease phenotyping and illness trajectory analysis (diachronic data analysis) to synchronic observational data. We developed a tool $ClinTrajan$ for clinical trajectory analysis implemented in Python programming language. We test the methodology in two large publicly available datasets: myocardial infarction complications and readmission of diabetic patients data. △ Less

Submitted 5 October, 2020; v1 submitted 7 July, 2020; originally announced July 2020.

ACM Class: I.2.6; J.3; J.2

Journal ref: GigaScience, Volume 9, Issue 11, 2020, giaa128,

arXiv:2005.06284 [pdf, other]

Pruning coupled with learning, ensembles of minimal neural networks, and future of XAI

Authors: Alexander N. Gorban, Evgeny M. Mirkes

Abstract: Pruning coupled with learning aims to optimize the neural network (NN) structure for solving specific problems. This optimization can be used for various purposes: to prevent overfitting, to save resources for implementation and training, to provide explainability of the trained NN, and many others. The minimal structure that cannot be pruned further is not unique. Ensemble of minimal structures c… ▽ More Pruning coupled with learning aims to optimize the neural network (NN) structure for solving specific problems. This optimization can be used for various purposes: to prevent overfitting, to save resources for implementation and training, to provide explainability of the trained NN, and many others. The minimal structure that cannot be pruned further is not unique. Ensemble of minimal structures can be used as a committee of intellectual agents that solves problems by voting. Each minimal NN presents an "empirical knowledge" about the problem and can be verbalized. The non-uniqueness of such knowledge extracted from data is an important property of data-driven Artificial Intelligence (AI). In this work, we review an approach to pruning based on the principle: What controls training should control pruning. This principle is expected to work both for artificial NN and for selection and modification of important synaptic contacts in brain. In back-propagation artificial NN learning is controlled by the gradient of loss functions. Therefore, the first order sensitivity indicators are used for pruning and the algorithms based on these indicators are reviewed. The notion of logically transparent NN was introduced. The approach was illustrated on the problem of political forecasting: predicting the results of the US presidential election. Eight minimal NN were produced that give different forecasting algorithms. The non-uniqueness of solution can be utilised by creation of expert panels (committee). Another use of NN pluralism is to identify areas of input signals where further data collection is most useful. In Conclusion, we discuss the possible future of widely advertised XAI program. △ Less

Submitted 22 January, 2023; v1 submitted 13 May, 2020; originally announced May 2020.

Comments: Significantly modified and extended version, 23 pages, 5 figures

arXiv:2004.14249 [pdf, other]

doi 10.3390/e22030264

Universal Gorban's Entropies: Geometric Case Study

Authors: Evgeny M Mirkes

Abstract: Recently, A.N. Gorban presented a rich family of universal Lyapunov functions for any linear or non-linear reaction network with detailed or complex balance. Two main elements of the construction algorithm are partial equilibria of reactions and convex envelopes of families of functions. These new functions aimed to resolve "the mystery" about the difference between the rich family of Lyapunov fun… ▽ More Recently, A.N. Gorban presented a rich family of universal Lyapunov functions for any linear or non-linear reaction network with detailed or complex balance. Two main elements of the construction algorithm are partial equilibria of reactions and convex envelopes of families of functions. These new functions aimed to resolve "the mystery" about the difference between the rich family of Lyapunov functions (f-divergences) for linear kinetics and a limited collection of Lyapunov functions for non-linear networks in thermodynamic conditions. The lack of examples did not allow to evaluate the difference between Gorban's entropies and the classical Boltzmann--Gibbs--Shannon entropy despite of obvious difference in their construction. In this paper, results of A.N. Gorban are briefly reviewed, and these functions are analysed and compared for several mechanisms of chemical reactions. The level sets and dynamics along the kinetic trajectories are analysed. The most pronounced difference between the new and classical thermodynamic Lyapunov functions was found far from the partial equilibria, whereas when some fast elementary reactions became close to equilibrium then this difference decreased and vanished in partial equilibria. △ Less

Submitted 29 April, 2020; originally announced April 2020.

Journal ref: Entropy, 22(3), p.264 (2020)

arXiv:2004.14230 [pdf, other]

doi 10.3390/e22101105

Fractional norms and quasinorms do not help to overcome the curse of dimensionality

Authors: Evgeny M. Mirkes, Jeza Allohibi, Alexander N. Gorban

Abstract: The curse of dimensionality causes the well-known and widely discussed problems for machine learning methods. There is a hypothesis that using of the Manhattan distance and even fractional quasinorms lp (for p less than 1) can help to overcome the curse of dimensionality in classification problems. In this study, we systematically test this hypothesis. We confirm that fractional quasinorms have a… ▽ More The curse of dimensionality causes the well-known and widely discussed problems for machine learning methods. There is a hypothesis that using of the Manhattan distance and even fractional quasinorms lp (for p less than 1) can help to overcome the curse of dimensionality in classification problems. In this study, we systematically test this hypothesis. We confirm that fractional quasinorms have a greater relative contrast or coefficient of variation than the Euclidean norm l2, but we also demonstrate that the distance concentration shows qualitatively the same behaviour for all tested norms and quasinorms and the difference between them decays as dimension tends to infinity. Estimation of classification quality for kNN based on different norms and quasinorms shows that a greater relative contrast does not mean better classifier performance and the worst performance for different databases was shown by different norms (quasinorms). A systematic comparison shows that the difference of the performance of kNN based on lp for p=2, 1, and 0.5 is statistically insignificant. △ Less

Submitted 29 April, 2020; originally announced April 2020.

Journal ref: Entropy. 2020; 22(10):1105

arXiv:2004.13717 [pdf, other]

Informational Space of Meaning for Scientific Texts

Authors: Neslihan Suzen, Evgeny M. Mirkes, Alexander N. Gorban

Abstract: In Natural Language Processing, automatic extracting the meaning of texts constitutes an important problem. Our focus is the computational analysis of meaning of short scientific texts (abstracts or brief reports). In this paper, a vector space model is developed for quantifying the meaning of words and texts. We introduce the Meaning Space, in which the meaning of a word is represented by a vecto… ▽ More In Natural Language Processing, automatic extracting the meaning of texts constitutes an important problem. Our focus is the computational analysis of meaning of short scientific texts (abstracts or brief reports). In this paper, a vector space model is developed for quantifying the meaning of words and texts. We introduce the Meaning Space, in which the meaning of a word is represented by a vector of Relative Information Gain (RIG) about the subject categories that the text belongs to, which can be obtained from observing the word in the text. This new approach is applied to construct the Meaning Space based on Leicester Scientific Corpus (LSC) and Leicester Scientific Dictionary-Core (LScDC). The LSC is a scientific corpus of 1,673,350 abstracts and the LScDC is a scientific dictionary which words are extracted from the LSC. Each text in the LSC belongs to at least one of 252 subject categories of Web of Science (WoS). These categories are used in construction of vectors of information gains. The Meaning Space is described and statistically analysed for the LSC with the LScDC. The usefulness of the proposed representation model is evaluated through top-ranked words in each category. The most informative n words are ordered. We demonstrated that RIG-based word ranking is much more useful than ranking based on raw word frequency in determining the science-specific meaning and importance of a word. The proposed model based on RIG is shown to have ability to stand out topic-specific words in categories. The most informative words are presented for 252 categories. The new scientific dictionary and the 103,998 x 252 Word-Category RIG Matrix are available online. Analysis of the Meaning Space provides us with a tool to further explore quantifying the meaning of a text using more complex and context-dependent meaning models that use co-occurrence of words and their combinations. △ Less

Submitted 28 April, 2020; originally announced April 2020.

Comments: 320 pages

arXiv:2001.06520 [pdf, other]

doi 10.1007/978-3-030-10442-9

Personality Traits and Drug Consumption. A Story Told by Data

Authors: Elaine Fehrman, Vincent Egan, Alexander N. Gorban, Jeremy Levesley, Evgeny M. Mirkes, Awaz K. Muhammad

Abstract: This is a preprint version of the first book from the series: "Stories told by data". In this book a story is told about the psychological traits associated with drug consumption. The book includes: - A review of published works on the psychological profiles of drug users. - Analysis of a new original database with information on 1885 respondents and usage of 18 drugs. (Database is available o… ▽ More This is a preprint version of the first book from the series: "Stories told by data". In this book a story is told about the psychological traits associated with drug consumption. The book includes: - A review of published works on the psychological profiles of drug users. - Analysis of a new original database with information on 1885 respondents and usage of 18 drugs. (Database is available online.) - An introductory description of the data mining and machine learning methods used for the analysis of this dataset. - The demonstration that the personality traits (five factor model, impulsivity, and sensation seeking), together with simple demographic data, give the possibility of predicting the risk of consumption of individual drugs with sensitivity and specificity above 70% for most drugs. - The analysis of correlations of use of different substances and the description of the groups of drugs with correlated use (correlation pleiades). - Proof of significant differences of personality profiles for users of different drugs. This is explicitly proved for benzodiazepines, ecstasy, and heroin. - Tables of personality profiles for users and non-users of 18 substances. The book is aimed at advanced undergraduates or first-year PhD students, as well as researchers and practitioners. No previous knowledge of machine learning, advanced data mining concepts or modern psychology of personality is assumed. For more detailed introduction into statistical methods we recommend several undergraduate textbooks. Familiarity with basic statistics and some experience in the use of probabilities would be helpful as well as some basic technical understanding of psychology. △ Less

Submitted 17 January, 2020; originally announced January 2020.

Comments: A preprint version prepared by the authors before the Springer editorial work. 124 pages, 27 figures, 63 tables, bibl. 244

Journal ref: Springer, Cham, Research Monograph, 2019, ISBN 978-3-030-10441-2

arXiv:1912.06858 [pdf, other]

LScDC-new large scientific dictionary

Authors: Neslihan Suzen, Evgeny M. Mirkes, Alexander N. Gorban

Abstract: In this paper, we present a scientific corpus of abstracts of academic papers in English -- Leicester Scientific Corpus (LSC). The LSC contains 1,673,824 abstracts of research articles and proceeding papers indexed by Web of Science (WoS) in which publication year is 2014. Each abstract is assigned to at least one of 252 subject categories. Paper metadata include these categories and the number of… ▽ More In this paper, we present a scientific corpus of abstracts of academic papers in English -- Leicester Scientific Corpus (LSC). The LSC contains 1,673,824 abstracts of research articles and proceeding papers indexed by Web of Science (WoS) in which publication year is 2014. Each abstract is assigned to at least one of 252 subject categories. Paper metadata include these categories and the number of citations. We then develop scientific dictionaries named Leicester Scientific Dictionary (LScD) and Leicester Scientific Dictionary-Core (LScDC), where words are extracted from the LSC. The LScD is a list of 974,238 unique words (lemmas). The LScDC is a core list (sub-list) of the LScD with 104,223 lemmas. It was created by removing LScD words appearing in not greater than 10 texts in the LSC. LScD and LScDC are available online. Both the corpus and dictionaries are developed to be later used for quantification of meaning in academic texts. Finally, the core list LScDC was analysed by comparing its words and word frequencies with a classic academic word list 'New Academic Word List (NAWL)' containing 963 word families, which is also sampled from an academic corpus. The major sources of the corpus where NAWL is extracted are Cambridge English Corpus (CEC), oral sources and textbooks. We investigate whether two dictionaries are similar in terms of common words and ranking of words. Our comparison leads us to main conclusion: most of words of NAWL (99.6%) are present in the LScDC but two lists differ in word ranking. This difference is measured. △ Less

Submitted 14 December, 2019; originally announced December 2019.

Comments: 63 pages

arXiv:1811.05321 [pdf, other]

doi 10.1016/j.ins.2018.07.040

Correction of AI systems by linear discriminants: Probabilistic foundations

Authors: A. N. Gorban, A. Golubkov, B. Grechuk, E. M. Mirkes, I. Y. Tyukin

Abstract: Artificial Intelligence (AI) systems sometimes make errors and will make errors in the future, from time to time. These errors are usually unexpected, and can lead to dramatic consequences. Intensive development of AI and its practical applications makes the problem of errors more important. Total re-engineering of the systems can create new errors and is not always possible due to the resources i… ▽ More Artificial Intelligence (AI) systems sometimes make errors and will make errors in the future, from time to time. These errors are usually unexpected, and can lead to dramatic consequences. Intensive development of AI and its practical applications makes the problem of errors more important. Total re-engineering of the systems can create new errors and is not always possible due to the resources involved. The important challenge is to develop fast methods to correct errors without damaging existing skills. We formulated the technical requirements to the 'ideal' correctors. Such correctors include binary classifiers, which separate the situations with high risk of errors from the situations where the AI systems work properly. Surprisingly, for essentially high-dimensional data such methods are possible: simple linear Fisher discriminant can separate the situations with errors from correctly solved tasks even for exponentially large samples. The paper presents the probabilistic basis for fast non-destructive correction of AI systems. A series of new stochastic separation theorems is proven. These theorems provide new instruments for fast non-iterative correction of errors of legacy AI systems. The new approaches become efficient in high-dimensions, for correction of high-dimensional systems in high-dimensional world (i.e. for processing of essentially high-dimensional data by large systems). △ Less

Submitted 11 November, 2018; originally announced November 2018.

Comments: arXiv admin note: text overlap with arXiv:1809.07656 and arXiv:1802.02172

Journal ref: Information Sciences 466 (2018), 303-322

arXiv:1805.01516 [pdf, ps, other]

How deep should be the depth of convolutional neural networks: a backyard dog case study

Authors: A. N. Gorban, E. M. Mirkes, I. Y. Tyukin

Abstract: The work concerns the problem of reducing a pre-trained deep neuronal network to a smaller network, with just few layers, whilst retaining the network's functionality on a given task The proposed approach is motivated by the observation that the aim to deliver the highest accuracy possible in the broadest range of operational conditions, which many deep neural networks models strive to achieve,… ▽ More The work concerns the problem of reducing a pre-trained deep neuronal network to a smaller network, with just few layers, whilst retaining the network's functionality on a given task The proposed approach is motivated by the observation that the aim to deliver the highest accuracy possible in the broadest range of operational conditions, which many deep neural networks models strive to achieve, may not necessarily be always needed, desired, or even achievable due to the lack of data or technical constraints. In relation to the face recognition problem, we formulated an example of such a usecase, the `backyard dog' problem. The `backyard dog', implemented by a lean network, should correctly identify members from a limited group of individuals, a `family', and should distinguish between them. At the same time, the network must produce an alarm to an image of an individual who is not in a member of the family. To produce such a network, we propose a shallowing algorithm. The algorithm takes an existing deep learning model on its input and outputs a shallowed version of it. The algorithm is non-iterative and is based on the Advanced Supervised Principal Component Analysis. Performance of the algorithm is assessed in exhaustive numerical experiments. In the above usecase, the `backyard dog' problem, the method is capable of drastically reducing the depth of deep learning neural networks, albeit at the cost of mild performance deterioration. We developed a simple non-iterative method for shallowing down pre-trained deep networks. The method is generic in the sense that it applies to a broad class of feed-forward networks, and is based on the Advanced Supervise Principal Component Analysis. The method enables generation of families of smaller-size shallower specialized networks tuned for specific operational conditions and tasks from a single larger and more universal legacy network. △ Less

Submitted 8 December, 2019; v1 submitted 3 May, 2018; originally announced May 2018.

Comments: Edited and extended version with more detailed description of numerical experiments

arXiv:1804.07580 [pdf]

doi 10.3390/e22030296

Robust And Scalable Learning Of Complex Dataset Topologies Via Elpigraph

Authors: Luca Albergante, Evgeny M. Mirkes, Huidong Chen, Alexis Martin, Louis Faure, Emmanuel Barillot, Luca Pinello, Alexander N. Gorban, Andrei Zinovyev

Abstract: Large datasets represented by multidimensional data point clouds often possess non-trivial distributions with branching trajectories and excluded regions, with the recent single-cell transcriptomic studies of developing embryo being notable examples. Reducing the complexity and producing compact and interpretable representations of such data remains a challenging task. Most of the existing computa… ▽ More Large datasets represented by multidimensional data point clouds often possess non-trivial distributions with branching trajectories and excluded regions, with the recent single-cell transcriptomic studies of developing embryo being notable examples. Reducing the complexity and producing compact and interpretable representations of such data remains a challenging task. Most of the existing computational methods are based on exploring the local data point neighbourhood relations, a step that can perform poorly in the case of multidimensional and noisy data. Here we present ElPiGraph, a scalable and robust method for approximation of datasets with complex structures which does not require computing the complete data distance matrix or the data point neighbourhood graph. This method is able to withstand high levels of noise and is capable of approximating complex topologies via principal graph ensembles that can be combined into a consensus principal graph. ElPiGraph deals efficiently with large and complex datasets in various fields from biology, where it can be used to infer gene dynamics from single-cell RNA-Seq, to astronomy, where it can be used to explore complex structures in the distribution of galaxies. △ Less

Submitted 20 June, 2018; v1 submitted 20 April, 2018; originally announced April 2018.

Comments: 32 pages, 14 figures

Journal ref: Entropy 22, no. 3: 296, 2020

arXiv:1702.02633 [pdf, other]

doi 10.1007/s11004-017-9701-2

Pseudo-Outcrop Visualization of Borehole Images and Core Scans

Authors: Evgeny M. Mirkes, Alexander N. Gorban, Jeremy Levesley, Peter A. S. Elkington, James A. Whetton

Abstract: A pseudo-outcrop visualization is demonstrated for borehole and full-diameter rock core images to augment the ubiquitous unwrapped cylinder view and thereby to assist non-specialist interpreters. The pseudo-outcrop visualization is equivalent to a nonlinear projection of the image from borehole to earth frame of reference that creates a solid volume sliced longitudinally to reveal two or more face… ▽ More A pseudo-outcrop visualization is demonstrated for borehole and full-diameter rock core images to augment the ubiquitous unwrapped cylinder view and thereby to assist non-specialist interpreters. The pseudo-outcrop visualization is equivalent to a nonlinear projection of the image from borehole to earth frame of reference that creates a solid volume sliced longitudinally to reveal two or more faces in which the orientations of geological features indicate what is observed in the subsurface. A proxy for grain size is used to modulate the external dimensions of the plot to mimic profiles seen in real outcrops. The volume is created from a mixture of geological boundary elements and texture, the latter being the residue after the sum of boundary elements is subtracted from the original data. In the case of measurements from wireline microresistivity tools, whose circumferential coverage is substantially less than 100%, the missing circumferential data is first inpainted using multiscale directional transforms, which decompose the image into its elemental building structures, before reconstructing the full image. The pseudo-outcrop view enables direct observation of the angular relationships between features and aids visual comparison between borehole and core images, especially for the interested non-specialist. △ Less

Submitted 3 September, 2017; v1 submitted 8 February, 2017; originally announced February 2017.

Comments: Updated and corrected version with extended set of figures

Journal ref: Mathematical Geosciences, 2017

arXiv:1605.06276 [pdf, ps, other]

doi 10.1016/j.neunet.2016.08.007

Piece-wise quadratic approximations of arbitrary error functions for fast and robust machine learning

Authors: A. N. Gorban, E. M. Mirkes, A. Zinovyev

Abstract: Most of machine learning approaches have stemmed from the application of minimizing the mean squared distance principle, based on the computationally efficient quadratic optimization methods. However, when faced with high-dimensional and noisy data, the quadratic error functionals demonstrated many weaknesses including high sensitivity to contaminating factors and dimensionality curse. Therefore,… ▽ More Most of machine learning approaches have stemmed from the application of minimizing the mean squared distance principle, based on the computationally efficient quadratic optimization methods. However, when faced with high-dimensional and noisy data, the quadratic error functionals demonstrated many weaknesses including high sensitivity to contaminating factors and dimensionality curse. Therefore, a lot of recent applications in machine learning exploited properties of non-quadratic error functionals based on $L_1$ norm or even sub-linear potentials corresponding to quasinorms $L_p$ ($0<p<1$). The back side of these approaches is increase in computational cost for optimization. Till so far, no approaches have been suggested to deal with {\it arbitrary} error functionals, in a flexible and computationally efficient framework. In this paper, we develop a theory and basic universal data approximation algorithms ($k$-means, principal components, principal manifolds and graphs, regularized and sparse regression), based on piece-wise quadratic error potentials of subquadratic growth (PQSQ potentials). We develop a new and universal framework to minimize {\it arbitrary sub-quadratic error potentials} using an algorithm with guaranteed fast convergence to the local or global error minimum. The theory of PQSQ potentials is based on the notion of the cone of minorant functions, and represents a natural approximation formalism based on the application of min-plus algebra. The approach can be applied in most of existing machine learning methods, including methods of data approximation and regularized and sparse regression, leading to the improvement in the computational cost/accuracy trade-off. We demonstrate that on synthetic and real-life datasets PQSQ-based machine learning methods achieve orders of magnitude faster computational performance than the corresponding state-of-the-art methods. △ Less

Submitted 21 August, 2016; v1 submitted 20 May, 2016; originally announced May 2016.

Comments: Edited and extended version with algortihms of regularized regression

Journal ref: Neural Networks, Volume 84, December 2016, 28-38

arXiv:1604.00627 [pdf, ps, other]

doi 10.1016/j.compbiomed.2016.06.004

Handling missing data in large healthcare dataset: a case study of unknown trauma outcomes

Authors: E. M. Mirkes, T. J. Coats, J. Levesley, A. N. Gorban

Abstract: Handling of missed data is one of the main tasks in data preprocessing especially in large public service datasets. We have analysed data from the Trauma Audit and Research Network (TARN) database, the largest trauma database in Europe. For the analysis we used 165,559 trauma cases. Among them, there are 19,289 cases (13.19\%) with unknown outcome. We have demonstrated that these outcomes are not… ▽ More Handling of missed data is one of the main tasks in data preprocessing especially in large public service datasets. We have analysed data from the Trauma Audit and Research Network (TARN) database, the largest trauma database in Europe. For the analysis we used 165,559 trauma cases. Among them, there are 19,289 cases (13.19\%) with unknown outcome. We have demonstrated that these outcomes are not missed `completely at random' and, hence, it is impossible just to exclude these cases from analysis despite the large amount of available data. We have developed a system of non-stationary Markov models for the handling of missed outcomes and validated these models on the data of 15,437 patients which arrived into TARN hospitals later than 24 hours but within 30 days from injury. We used these Markov models for the analysis of mortality. In particular, we corrected the observed fraction of death. Two naïve approaches give 7.20\% (available case study) or 6.36\% (if we assume that all unknown outcomes are `alive'). The corrected value is 6.78\%. Following the seminal paper of Trunkey (1983) the multimodality of mortality curves has become a much discussed idea. For the whole analysed TARN dataset the coefficient of mortality monotonically decreases in time but the stratified analysis of the mortality gives a different result: for lower severities the coefficient of mortality is a non-monotonic function of the time after injury and may have maxima at the second and third weeks. The approach developed here can be applied to various healthcare datasets which experience the problem of lost patients and missed outcomes. △ Less

Submitted 18 May, 2020; v1 submitted 3 April, 2016; originally announced April 2016.

Comments: Minor editing and additions

Journal ref: Computers in Biology and Medicine, 75 (2016) 203-216

arXiv:1603.06828 [pdf, other]

doi 10.5445/KSP/1000058749/11

Robust principal graphs for data approximation

Authors: A. N. Gorban, E. M. Mirkes, A. Zinovyev

Abstract: Revealing hidden geometry and topology in noisy data sets is a challenging task. Elastic principal graph is a computationally efficient and flexible data approximator based on embedding a graph into the data space and minimizing the energy functional penalizing the deviation of graph nodes both from data points and from pluri-harmonic configuration (generalization of linearity). The structure of p… ▽ More Revealing hidden geometry and topology in noisy data sets is a challenging task. Elastic principal graph is a computationally efficient and flexible data approximator based on embedding a graph into the data space and minimizing the energy functional penalizing the deviation of graph nodes both from data points and from pluri-harmonic configuration (generalization of linearity). The structure of principal graph is learned from data by application of a topological grammar which in the simplest case leads to the construction of principal curves or trees. In order to more efficiently cope with noise and outliers, here we suggest using a trimmed data approximation term to increase the robustness of the method. The modification of the method that we suggest does not affect either computational efficiency or general convergence properties of the original elastic graph method. The trimmed elastic energy functional remains a Lyapunov function for the optimization algorithm. On several examples of complex data distributions we demonstrate how the robust principal graphs learn the global data structure and show the advantage of using the trimmed data approximation term for the construction of principal graphs and other popular data approximators. △ Less

Submitted 24 November, 2016; v1 submitted 22 March, 2016; originally announced March 2016.

Comments: A talk given at ECDA2015 (European Conference on Data Analysis, September 2nd to 4th 2015, University of Essex, Colchester, UK), to be published in Archives of Data Science

Journal ref: Archives of Data Science, Series A, Vol. 2, No. 1, 2017

arXiv:1506.06297 [pdf, ps, other]

The Five Factor Model of personality and evaluation of drug consumption risk

Authors: E. Fehrman, A. K. Muhammad, E. M. Mirkes, V. Egan, A. N. Gorban

Abstract: The problem of evaluating an individual's risk of drug consumption and misuse is highly important. An online survey methodology was employed to collect data including Big Five personality traits (NEO-FFI-R), impulsivity (BIS-11), sensation seeking (ImpSS), and demographic information. The data set contained information on the consumption of 18 central nervous system psychoactive drugs. Correlation… ▽ More The problem of evaluating an individual's risk of drug consumption and misuse is highly important. An online survey methodology was employed to collect data including Big Five personality traits (NEO-FFI-R), impulsivity (BIS-11), sensation seeking (ImpSS), and demographic information. The data set contained information on the consumption of 18 central nervous system psychoactive drugs. Correlation analysis demonstrated the existence of groups of drugs with strongly correlated consumption patterns. Three correlation pleiades were identified, named by the central drug in the pleiade: ecstasy, heroin, and benzodiazepines pleiades. An exhaustive search was performed to select the most effective subset of input features and data mining methods to classify users and non-users for each drug and pleiad. A number of classification methods were employed (decision tree, random forest, $k$-nearest neighbors, linear discriminant analysis, Gaussian mixture, probability density function estimation, logistic regression and na{ï}ve Bayes) and the most effective classifier was selected for each drug. The quality of classification was surprisingly high with sensitivity and specificity (evaluated by leave-one-out cross-validation) being greater than 70\% for almost all classification tasks. The best results with sensitivity and specificity being greater than 75\% were achieved for cannabis, crack, ecstasy, legal highs, LSD, and volatile substance abuse (VSA). △ Less

Submitted 15 January, 2017; v1 submitted 20 June, 2015; originally announced June 2015.

Comments: Significantly extended report with 67 pages, 27 tables, 21 figures

arXiv:1503.05869 [pdf]

Long and short range multi-locus QTL interactions in a complex trait of yeast

Authors: Evgeny M. Mirkes, Thomas Walsh, Edward J. Louis, Alexander N. Gorban

Abstract: We analyse interactions of Quantitative Trait Loci (QTL) in heat selected yeast by comparing them to an unselected pool of random individuals. Here we re-examine data on individual F12 progeny selected for heat tolerance, which have been genotyped at 25 locations identified by sequencing a selected pool [Parts, L., Cubillos, F. A., Warringer, J., Jain, K., Salinas, F., Bumpstead, S. J., Molin, M.,… ▽ More We analyse interactions of Quantitative Trait Loci (QTL) in heat selected yeast by comparing them to an unselected pool of random individuals. Here we re-examine data on individual F12 progeny selected for heat tolerance, which have been genotyped at 25 locations identified by sequencing a selected pool [Parts, L., Cubillos, F. A., Warringer, J., Jain, K., Salinas, F., Bumpstead, S. J., Molin, M., Zia, A., Simpson, J. T., Quail, M. A., Moses, A., Louis, E. J., Durbin, R., and Liti, G. (2011). Genome research, 21(7), 1131-1138]. 960 individuals were genotyped at these locations and multi-locus genotype frequencies were compared to 172 sequenced individuals from the original unselected pool (a control group). Various non-random associations were found across the genome, both within chromosomes and between chromosomes. Some of the non-random associations are likely due to retention of linkage disequilibrium in the F12 population, however many, including the inter-chromosomal interactions, must be due to genetic interactions in heat tolerance. One region of particular interest involves 3 linked loci on chromosome IV where the central variant responsible for heat tolerance is antagonistic, coming from the heat sensitive parent and the flanking ones are from the more heat tolerant parent. The 3-locus haplotypes in the selected individuals represent a highly biased sample of the population haplotypes with rare double recombinants in high frequency. These were missed in the original analysis and would never be seen without the multigenerational approach. We show that a statistical analysis of entropy and information gain in genotypes of a selected population can reveal further interactions than previously seen. Importantly this must be done in comparison to the unselected population's genotypes to account for inherent biases in the original population. △ Less

Submitted 19 March, 2015; originally announced March 2015.

arXiv:1305.4942 [pdf, ps, other]

doi 10.1016/j.compbiomed.2014.08.006

Computational diagnosis and risk evaluation for canine lymphoma

Authors: E. M. Mirkes, I. Alexandrakis, K. Slater, R. Tuli, A. N. Gorban

Abstract: The canine lymphoma blood test detects the levels of two biomarkers, the acute phase proteins (C-Reactive Protein and Haptoglobin). This test can be used for diagnostics, for screening, and for remission monitoring as well. We analyze clinical data, test various machine learning methods and select the best approach to these problems. Three family of methods, decision trees, kNN (including advanced… ▽ More The canine lymphoma blood test detects the levels of two biomarkers, the acute phase proteins (C-Reactive Protein and Haptoglobin). This test can be used for diagnostics, for screening, and for remission monitoring as well. We analyze clinical data, test various machine learning methods and select the best approach to these problems. Three family of methods, decision trees, kNN (including advanced and adaptive kNN) and probability density evaluation with radial basis functions, are used for classification and risk estimation. Several pre-processing approaches were implemented and compared. The best of them are used to create the diagnostic system. For the differential diagnosis the best solution gives the sensitivity and specificity of 83.5% and 77%, respectively (using three input features, CRP, Haptoglobin and standard clinical symptom). For the screening task, the decision tree method provides the best result, with sensitivity and specificity of 81.4% and >99%, respectively (using the same input features). If the clinical symptoms (Lymphadenopathy) are considered as unknown then a decision tree with CRP and Hapt only provides sensitivity 69% and specificity 83.5%. The lymphoma risk evaluation problem is formulated and solved. The best models are selected as the system for computational lymphoma diagnosis and evaluation the risk of lymphoma as well. These methods are implemented into a special web-accessed software and are applied to problem of monitoring dogs with lymphoma after treatment. It detects recurrence of lymphoma up to two months prior to the appearance of clinical signs. The risk map visualisation provides a friendly tool for explanatory data analysis. △ Less

Submitted 3 July, 2014; v1 submitted 21 May, 2013; originally announced May 2013.

Comments: 24 pages, 86 references in the bibliography, Significantly extended version with review of lymphoma biomarkers and data mining methods (Three new sections are added: 1.1. Biomarkers for canine lymphoma, 1.2. Acute phase proteins as lymphoma biomarkers and 3.1. Data mining methods for biomarker cancer diagnosis. Flowcharts of data analysis are included as supplementary material (20 pages)

Journal ref: Computers in Biology and Medicine, Volume 53, 1 October 2014, 279-290

arXiv:1302.2645 [pdf]

doi 10.1007/978-3-642-38679-4_50

Geometrical complexity of data approximators

Authors: E. M. Mirkes, A. Zinovyev, A. N. Gorban

Abstract: There are many methods developed to approximate a cloud of vectors embedded in high-dimensional space by simpler objects: starting from principal points and linear manifolds to self-organizing maps, neural gas, elastic maps, various types of principal curves and principal trees, and so on. For each type of approximators the measure of the approximator complexity was developed too. These measures a… ▽ More There are many methods developed to approximate a cloud of vectors embedded in high-dimensional space by simpler objects: starting from principal points and linear manifolds to self-organizing maps, neural gas, elastic maps, various types of principal curves and principal trees, and so on. For each type of approximators the measure of the approximator complexity was developed too. These measures are necessary to find the balance between accuracy and complexity and to define the optimal approximations of a given type. We propose a measure of complexity (geometrical complexity) which is applicable to approximators of several types and which allows comparing data approximations of different types. △ Less

Submitted 3 May, 2013; v1 submitted 11 February, 2013; originally announced February 2013.

Comments: 10 pages, 3 figures, minor correction and extension

Journal ref: IWANN 2013, Advances in Computation Intelligence, Springer LNCS 7902, pp. 500-509, 2013

arXiv:1210.5873 [pdf]

Initialization of Self-Organizing Maps: Principal Components Versus Random Initialization. A Case Study

Authors: A. A. Akinduko, E. M. Mirkes

Abstract: The performance of the Self-Organizing Map (SOM) algorithm is dependent on the initial weights of the map. The different initialization methods can broadly be classified into random and data analysis based initialization approach. In this paper, the performance of random initialization (RI) approach is compared to that of principal component initialization (PCI) in which the initial map weights ar… ▽ More The performance of the Self-Organizing Map (SOM) algorithm is dependent on the initial weights of the map. The different initialization methods can broadly be classified into random and data analysis based initialization approach. In this paper, the performance of random initialization (RI) approach is compared to that of principal component initialization (PCI) in which the initial map weights are chosen from the space of the principal component. Performance is evaluated by the fraction of variance unexplained (FVU). Datasets were classified into quasi-linear and non-linear and it was observed that RI performed better for non-linear datasets; however the performance of PCI approach remains inconclusive for quasi-linear datasets. △ Less

Submitted 22 October, 2012; originally announced October 2012.

Comments: 18 pages, 6 figures

arXiv:1207.2507 [pdf, ps, other]

doi 10.1016/j.physa.2012.10.009

Thermodynamics in the Limit of Irreversible Reactions

Authors: A. N. Gorban, E. M. Mirkes, G. S. Yablonsky

Abstract: For many real physico-chemical complex systems detailed mechanism includes both reversible and irreversible reactions. Such systems are typical in homogeneous combustion and heterogeneous catalytic oxidation. Most complex enzyme reactions include irreversible steps. The classical thermodynamics has no limit for irreversible reactions whereas the kinetic equations may have such a limit. We represen… ▽ More For many real physico-chemical complex systems detailed mechanism includes both reversible and irreversible reactions. Such systems are typical in homogeneous combustion and heterogeneous catalytic oxidation. Most complex enzyme reactions include irreversible steps. The classical thermodynamics has no limit for irreversible reactions whereas the kinetic equations may have such a limit. We represent the systems with irreversible reactions as the limits of the fully reversible systems when some of the equilibrium concentrations tend to zero. The structure of the limit reaction system crucially depends on the relative rates of this tendency to zero. We study the dynamics of the limit system and describe its limit behavior as $t \to \infty$. If the reversible systems obey the principle of detailed balance then the limit system with some irreversible reactions must satisfy the {\em extended principle of detailed balance}. It is formulated and proven in the form of two conditions: (i) the reversible part satisfies the principle of detailed balance and (ii) the convex hull of the stoichiometric vectors of the irreversible reactions does not intersect the linear span of the stoichiometric vectors of the reversible reactions. These conditions imply the existence of the global Lyapunov functionals and alow an algebraic description of the limit behavior. The thermodynamic theory of the irreversible limit of reversible reactions is illustrated by the analysis of hydrogen combustion. △ Less

Submitted 11 October, 2012; v1 submitted 10 July, 2012; originally announced July 2012.

Comments: 23 pages, extended version with figs

Journal ref: Physica A, Volume 392, Issue 6, 2013, Pages 1318-1335

arXiv:cond-mat/0307083 [pdf]

Generation of Explicit Knowledge from Empirical Data through Pruning of Trainable Neural Networks

Authors: A. N. Gorban, Eu. M. Mirkes, V. G. Tsaregorodtsev

Abstract: This paper presents a generalized technology of extraction of explicit knowledge from data. The main ideas are 1) maximal reduction of network complexity (not only removal of neurons or synapses, but removal all the unnecessary elements and signals and reduction of the complexity of elements), 2) using of adjustable and flexible pruning process (the pruning sequence shouldn't be predetermined -… ▽ More This paper presents a generalized technology of extraction of explicit knowledge from data. The main ideas are 1) maximal reduction of network complexity (not only removal of neurons or synapses, but removal all the unnecessary elements and signals and reduction of the complexity of elements), 2) using of adjustable and flexible pruning process (the pruning sequence shouldn't be predetermined - the user should have a possibility to prune network on his own way in order to achieve a desired network structure for the purpose of extraction of rules of desired type and form), and 3) extraction of rules not in predetermined but any desired form. Some considerations and notes about network architecture and training process and applicability of currently developed pruning techniques and rule extraction algorithms are discussed. This technology, being developed by us for more than 10 years, allowed us to create dozens of knowledge-based expert systems. In this paper we present a generalized three-step technology of extraction of explicit knowledge from empirical data. △ Less

Submitted 3 July, 2003; originally announced July 2003.

Comments: 9 pages, The talk was given at the IJCNN '99 (Washington DC, July 1999)

Showing 1–32 of 32 results for author: Mirkes, E M