-
Practical machine learning is learning on small samples
Authors:
Marina Sapir
Abstract:
Based on limited observations, machine learning discerns a dependence which is expected to hold in the future. What makes it possible? Statistical learning theory imagines indefinitely increasing training sample to justify its approach. In reality, there is no infinite time or even infinite general population for learning. Here I argue that practical machine learning is based on an implicit assump…
▽ More
Based on limited observations, machine learning discerns a dependence which is expected to hold in the future. What makes it possible? Statistical learning theory imagines indefinitely increasing training sample to justify its approach. In reality, there is no infinite time or even infinite general population for learning. Here I argue that practical machine learning is based on an implicit assumption that underlying dependence is relatively ``smooth" : likely, there are no abrupt differences in feedback between cases with close data points. From this point of view learning shall involve selection of the hypothesis ``smoothly" approximating the training set. I formalize this as Practical learning paradigm. The paradigm includes terminology and rules for description of learners. Popular learners (local smoothing, k-NN, decision trees, Naive Bayes, SVM for classification and for regression) are shown here to be implementations of this paradigm.
△ Less
Submitted 3 January, 2025;
originally announced January 2025.
-
Theory of Machine Learning with Limited Data
Authors:
Marina Sapir
Abstract:
Application of machine learning may be understood as deriving new knowledge for practical use through explaining accumulated observations, training set. Peirce used the term abduction for this kind of inference. Here I formalize the concept of abduction for real valued hypotheses, and show that 14 of the most popular textbook ML learners (every learner I tested), covering classification, regressio…
▽ More
Application of machine learning may be understood as deriving new knowledge for practical use through explaining accumulated observations, training set. Peirce used the term abduction for this kind of inference. Here I formalize the concept of abduction for real valued hypotheses, and show that 14 of the most popular textbook ML learners (every learner I tested), covering classification, regression and clustering, implement this concept of abduction inference. The approach is proposed as an alternative to statistical learning theory, which requires an impractical assumption of indefinitely increasing training set for its justification.
△ Less
Submitted 1 January, 2023; v1 submitted 15 June, 2022;
originally announced June 2022.
-
Logic of Machine Learning
Authors:
Marina Sapir
Abstract:
The main question is: why and how can we ever predict based on a finite sample? The question is not answered by statistical learning theory. Here, I suggest that prediction requires belief in "predictability" of the underlying dependence, and learning involves search for a hypothesis where these beliefs are violated the least given the observations. The measure of these violations ("errors") for g…
▽ More
The main question is: why and how can we ever predict based on a finite sample? The question is not answered by statistical learning theory. Here, I suggest that prediction requires belief in "predictability" of the underlying dependence, and learning involves search for a hypothesis where these beliefs are violated the least given the observations. The measure of these violations ("errors") for given data, hypothesis and particular type of predictability beliefs is formalized as concept of incongruity in modal Logic of Observations and Hypotheses (LOH). I show on examples of many popular textbook learners (from hierarchical clustering to k-NN and SVM) that each of them minimizes its own version of incongruity. In addition, the concept of incongruity is shown to be flexible enough for formalization of some important data analysis problems, not considered as part of ML.
△ Less
Submitted 27 January, 2022; v1 submitted 16 June, 2020;
originally announced June 2020.
-
Learnable: Theory vs Applications
Authors:
Marina Sapir
Abstract:
Two different views on machine learning problem: Applied learning (machine learning with business applications) and Agnostic PAC learning are formalized and compared here. I show that, under some conditions, the theory of PAC Learnable provides a way to solve the Applied learning problem. However, the theory requires to have the training sets so large, that it would make the learning practically u…
▽ More
Two different views on machine learning problem: Applied learning (machine learning with business applications) and Agnostic PAC learning are formalized and compared here. I show that, under some conditions, the theory of PAC Learnable provides a way to solve the Applied learning problem. However, the theory requires to have the training sets so large, that it would make the learning practically useless. I suggest shedding some theoretical misconceptions about learning to make the theory more aligned with the needs and experience of practitioners.
△ Less
Submitted 27 July, 2018;
originally announced July 2018.
-
Optimal choice: new machine learning problem and its solution
Authors:
Marina Sapir
Abstract:
The task of learning to pick a single preferred example out a finite set of examples, an "optimal choice problem", is a supervised machine learning problem with complex, structured input. Problems of optimal choice emerge often in various practical applications. We formalize the problem, show that it does not satisfy the assumptions of statistical learning theory, yet it can be solved efficiently…
▽ More
The task of learning to pick a single preferred example out a finite set of examples, an "optimal choice problem", is a supervised machine learning problem with complex, structured input. Problems of optimal choice emerge often in various practical applications. We formalize the problem, show that it does not satisfy the assumptions of statistical learning theory, yet it can be solved efficiently in some cases. We propose two approaches to solve the problem. Both of them reach good solutions on real life data from a signal processing application.
△ Less
Submitted 6 July, 2017; v1 submitted 26 June, 2017;
originally announced June 2017.
-
Bipartite ranking algorithm for classification and survival analysis
Authors:
Marina Sapir
Abstract:
Unsupervised aggregation of independently built univariate predictors is explored as an alternative regularization approach for noisy, sparse datasets. Bipartite ranking algorithm Smooth Rank implementing this approach is introduced. The advantages of this algorithm are demonstrated on two types of problems. First, Smooth Rank is applied to two-class problems from bio-medical field, where ranking…
▽ More
Unsupervised aggregation of independently built univariate predictors is explored as an alternative regularization approach for noisy, sparse datasets. Bipartite ranking algorithm Smooth Rank implementing this approach is introduced. The advantages of this algorithm are demonstrated on two types of problems. First, Smooth Rank is applied to two-class problems from bio-medical field, where ranking is often preferable to classification. In comparison against SVMs with radial and linear kernels, Smooth Rank had the best performance on 8 out of 12 benchmark benchmarks. The second area of application is survival analysis, which is reduced here to bipartite ranking in a way which allows one to use commonly accepted measures of methods performance. In comparison of Smooth Rank with Cox PH regression and CoxPath methods, Smooth Rank proved to be the best on 9 out of 10 benchmark datasets.
△ Less
Submitted 8 December, 2011;
originally announced December 2011.
-
Bias Plus Variance Decomposition for Survival Analysis Problems
Authors:
Marina Sapir
Abstract:
Bias - variance decomposition of the expected error defined for regression and classification problems is an important tool to study and compare different algorithms, to find the best areas for their application. Here the decomposition is introduced for the survival analysis problem. In our experiments, we study bias -variance parts of the expected error for two algorithms: original Cox proportion…
▽ More
Bias - variance decomposition of the expected error defined for regression and classification problems is an important tool to study and compare different algorithms, to find the best areas for their application. Here the decomposition is introduced for the survival analysis problem. In our experiments, we study bias -variance parts of the expected error for two algorithms: original Cox proportional hazard regression and CoxPath, path algorithm for L1-regularized Cox regression, on the series of increased training sets. The experiments demonstrate that, contrary expectations, CoxPath does not necessarily have an advantage over Cox regression.
△ Less
Submitted 24 September, 2011;
originally announced September 2011.
-
Ensemble Risk Modeling Method for Robust Learning on Scarce Data
Authors:
Marina Sapir
Abstract:
In medical risk modeling, typical data are "scarce": they have relatively small number of training instances (N), censoring, and high dimensionality (M). We show that the problem may be effectively simplified by reducing it to bipartite ranking, and introduce new bipartite ranking algorithm, Smooth Rank, for robust learning on scarce data. The algorithm is based on ensemble learning with unsupervi…
▽ More
In medical risk modeling, typical data are "scarce": they have relatively small number of training instances (N), censoring, and high dimensionality (M). We show that the problem may be effectively simplified by reducing it to bipartite ranking, and introduce new bipartite ranking algorithm, Smooth Rank, for robust learning on scarce data. The algorithm is based on ensemble learning with unsupervised aggregation of predictors. The advantage of our approach is confirmed in comparison with two "gold standard" risk modeling methods on 10 real life survival analysis datasets, where the new approach has the best results on all but two datasets with the largest ratio N/M. For systematic study of the effects of data scarcity on modeling by all three methods, we conducted two types of computational experiments: on real life data with randomly drawn training sets of different sizes, and on artificial data with increasing number of features. Both experiments demonstrated that Smooth Rank has critical advantage over the popular methods on the scarce data; it does not suffer from overfitting where other methods do.
△ Less
Submitted 28 January, 2012; v1 submitted 13 August, 2011;
originally announced August 2011.