-
Quantifying Behavioural Distance Between Mathematical Expressions
Authors:
Sebastian Mežnar,
Sašo Džeroski,
Ljupčo Todorovski
Abstract:
Existing symbolic regression methods organize the space of candidate mathematical expressions primarily based on their syntactic, structural similarity. However, this approach overlooks crucial equivalences between expressions that arise from mathematical symmetries, such as commutativity, associativity, and distribution laws for arithmetic operations. Consequently, expressions with similar errors…
▽ More
Existing symbolic regression methods organize the space of candidate mathematical expressions primarily based on their syntactic, structural similarity. However, this approach overlooks crucial equivalences between expressions that arise from mathematical symmetries, such as commutativity, associativity, and distribution laws for arithmetic operations. Consequently, expressions with similar errors on a given data set are apart from each other in the search space. This leads to a rough error landscape in the search space that efficient local, gradient-based methods cannot explore. This paper proposes and implements a measure of a behavioral distance, BED, that clusters together expressions with similar errors. The experimental results show that the stochastic method for calculating BED achieves consistency with a modest number of sampled values for evaluating the expressions. This leads to computational efficiency comparable to the tree-based syntactic distance. Our findings also reveal that BED significantly improves the smoothness of the error landscape in the search space for symbolic regression.
△ Less
Submitted 21 August, 2024;
originally announced August 2024.
-
MLFMF: Data Sets for Machine Learning for Mathematical Formalization
Authors:
Andrej Bauer,
Matej Petković,
Ljupčo Todorovski
Abstract:
We introduce MLFMF, a collection of data sets for benchmarking recommendation systems used to support formalization of mathematics with proof assistants. These systems help humans identify which previous entries (theorems, constructions, datatypes, and postulates) are relevant in proving a new theorem or carrying out a new construction. Each data set is derived from a library of formalized mathema…
▽ More
We introduce MLFMF, a collection of data sets for benchmarking recommendation systems used to support formalization of mathematics with proof assistants. These systems help humans identify which previous entries (theorems, constructions, datatypes, and postulates) are relevant in proving a new theorem or carrying out a new construction. Each data set is derived from a library of formalized mathematics written in proof assistants Agda or Lean. The collection includes the largest Lean~4 library Mathlib, and some of the largest Agda libraries: the standard library, the library of univalent mathematics Agda-unimath, and the TypeTopology library. Each data set represents the corresponding library in two ways: as a heterogeneous network, and as a list of s-expressions representing the syntax trees of all the entries in the library. The network contains the (modular) structure of the library and the references between entries, while the s-expressions give complete and easily parsed information about every entry. We report baseline results using standard graph and word embeddings, tree ensembles, and instance-based learning algorithms. The MLFMF data sets provide solid benchmarking support for further investigation of the numerous machine learning approaches to formalized mathematics. The methodology used to extract the networks and the s-expressions readily applies to other libraries, and is applicable to other proof assistants. With more than $250\,000$ entries in total, this is currently the largest collection of formalized mathematical knowledge in machine learnable format.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
Efficient Generator of Mathematical Expressions for Symbolic Regression
Authors:
Sebastian Mežnar,
Sašo Džeroski,
Ljupčo Todorovski
Abstract:
We propose an approach to symbolic regression based on a novel variational autoencoder for generating hierarchical structures, HVAE. It combines simple atomic units with shared weights to recursively encode and decode the individual nodes in the hierarchy. Encoding is performed bottom-up and decoding top-down. We empirically show that HVAE can be trained efficiently with small corpora of mathemati…
▽ More
We propose an approach to symbolic regression based on a novel variational autoencoder for generating hierarchical structures, HVAE. It combines simple atomic units with shared weights to recursively encode and decode the individual nodes in the hierarchy. Encoding is performed bottom-up and decoding top-down. We empirically show that HVAE can be trained efficiently with small corpora of mathematical expressions and can accurately encode expressions into a smooth low-dimensional latent space. The latter can be efficiently explored with various optimization methods to address the task of symbolic regression. Indeed, random search through the latent space of HVAE performs better than random search through expressions generated by manually crafted probabilistic grammars for mathematical expressions. Finally, EDHiE system for symbolic regression, which applies an evolutionary algorithm to the latent space of HVAE, reconstructs equations from a standard symbolic regression benchmark better than a state-of-the-art system based on a similar combination of deep learning and evolutionary algorithms.ž
△ Less
Submitted 10 September, 2023; v1 submitted 20 February, 2023;
originally announced February 2023.
-
P(Expression|Grammar): Probability of deriving an algebraic expression with a probabilistic context-free grammar
Authors:
Urh Primožič,
Ljupčo Todorovski,
Matej Petković
Abstract:
Probabilistic context-free grammars have a long-term record of use as generative models in machine learning and symbolic regression. When used for symbolic regression, they generate algebraic expressions. We define the latter as equivalence classes of strings derived by grammar and address the problem of calculating the probability of deriving a given expression with a given grammar. We show that…
▽ More
Probabilistic context-free grammars have a long-term record of use as generative models in machine learning and symbolic regression. When used for symbolic regression, they generate algebraic expressions. We define the latter as equivalence classes of strings derived by grammar and address the problem of calculating the probability of deriving a given expression with a given grammar. We show that the problem is undecidable in general. We then present specific grammars for generating linear, polynomial, and rational expressions, where algorithms for calculating the probability of a given expression exist. For those grammars, we design algorithms for calculating the exact probability and efficient approximation with arbitrary precision.
△ Less
Submitted 2 December, 2022; v1 submitted 1 December, 2022;
originally announced December 2022.
-
Boosting the Performance of Quantum Annealers using Machine Learning
Authors:
Jure Brence,
Dragan Mihailović,
Viktor Kabanov,
Ljupčo Todorovski,
Sašo Džeroski,
Jaka Vodeb
Abstract:
Noisy intermediate-scale quantum (NISQ) devices are spearheading the second quantum revolution. Of these, quantum annealers are the only ones currently offering real world, commercial applications on as many as 5000 qubits. The size of problems that can be solved by quantum annealers is limited mainly by errors caused by environmental noise and intrinsic imperfections of the processor. We address…
▽ More
Noisy intermediate-scale quantum (NISQ) devices are spearheading the second quantum revolution. Of these, quantum annealers are the only ones currently offering real world, commercial applications on as many as 5000 qubits. The size of problems that can be solved by quantum annealers is limited mainly by errors caused by environmental noise and intrinsic imperfections of the processor. We address the issue of intrinsic imperfections with a novel error correction approach, based on machine learning methods. Our approach adjusts the input Hamiltonian to maximize the probability of finding the solution. In our experiments, the proposed error correction method improved the performance of annealing by up to three orders of magnitude and enabled the solving of a previously intractable, maximally complex problem.
△ Less
Submitted 7 March, 2022; v1 submitted 4 March, 2022;
originally announced March 2022.
-
Predicting Hidden Links and Missing Nodes in Scale-Free Networks with Artificial Neural Networks
Authors:
Rakib Hassan Pran,
Ljupco Todorovski
Abstract:
There are many networks in real life which exist as form of Scale-free networks such as World Wide Web, protein-protein inter action network, semantic networks, airline networks, interbank payment networks, etc. If we want to analyze these networks, it is really necessary to understand the properties of scale-free networks. By using the properties of scale free networks, we can identify any type o…
▽ More
There are many networks in real life which exist as form of Scale-free networks such as World Wide Web, protein-protein inter action network, semantic networks, airline networks, interbank payment networks, etc. If we want to analyze these networks, it is really necessary to understand the properties of scale-free networks. By using the properties of scale free networks, we can identify any type of anomalies in those networks. In this research, we proposed a methodology in a form of an algorithm to predict hidden links and missing nodes in scale-free networks where we combined a generator of random networks as a source of train data, on one hand, with artificial neural networks for supervised classification, on the other, we aimed at training the neural networks to discriminate between different subtypes of scale-free networks and predicted the missing nodes and hidden links among (present and missing) nodes in a given scale-free network. We chose Bela Bollobas's directed scale-free random graph generation algorithm as a generator of random networks to generate a large set of scale-free network's data.
△ Less
Submitted 25 September, 2021;
originally announced September 2021.
-
Explaining the Performance of Multi-label Classification Methods with Data Set Properties
Authors:
Jasmin Bogatinovski,
Ljupčo Todorovski,
Sašo Džeroski,
Dragi Kocev
Abstract:
Meta learning generalizes the empirical experience with different learning tasks and holds promise for providing important empirical insight into the behaviour of machine learning algorithms. In this paper, we present a comprehensive meta-learning study of data sets and methods for multi-label classification (MLC). MLC is a practically relevant machine learning task where each example is labelled…
▽ More
Meta learning generalizes the empirical experience with different learning tasks and holds promise for providing important empirical insight into the behaviour of machine learning algorithms. In this paper, we present a comprehensive meta-learning study of data sets and methods for multi-label classification (MLC). MLC is a practically relevant machine learning task where each example is labelled with multiple labels simultaneously. Here, we analyze 40 MLC data sets by using 50 meta features describing different properties of the data. The main findings of this study are as follows. First, the most prominent meta features that describe the space of MLC data sets are the ones assessing different aspects of the label space. Second, the meta models show that the most important meta features describe the label space, and, the meta features describing the relationships among the labels tend to occur a bit more often than the meta features describing the distributions between and within the individual labels. Third, the optimization of the hyperparameters can improve the predictive performance, however, quite often the extent of the improvements does not always justify the resource utilization.
△ Less
Submitted 28 June, 2021;
originally announced June 2021.
-
Comprehensive Comparative Study of Multi-Label Classification Methods
Authors:
Jasmin Bogatinovski,
Ljupčo Todorovski,
Sašo Džeroski,
Dragi Kocev
Abstract:
Multi-label classification (MLC) has recently received increasing interest from the machine learning community. Several studies provide reviews of methods and datasets for MLC and a few provide empirical comparisons of MLC methods. However, they are limited in the number of methods and datasets considered. This work provides a comprehensive empirical study of a wide range of MLC methods on a pleth…
▽ More
Multi-label classification (MLC) has recently received increasing interest from the machine learning community. Several studies provide reviews of methods and datasets for MLC and a few provide empirical comparisons of MLC methods. However, they are limited in the number of methods and datasets considered. This work provides a comprehensive empirical study of a wide range of MLC methods on a plethora of datasets from various domains. More specifically, our study evaluates 26 methods on 42 benchmark datasets using 20 evaluation measures. The adopted evaluation methodology adheres to the highest literature standards for designing and executing large scale, time-budgeted experimental studies. First, the methods are selected based on their usage by the community, assuring representation of methods across the MLC taxonomy of methods and different base learners. Second, the datasets cover a wide range of complexity and domains of application. The selected evaluation measures assess the predictive performance and the efficiency of the methods. The results of the analysis identify RFPCT, RFDTBR, ECCJ48, EBRJ48 and AdaBoostMH as best performing methods across the spectrum of performance measures. Whenever a new method is introduced, it should be compared to different subsets of MLC methods, determined on the basis of the different evaluation criteria.
△ Less
Submitted 16 February, 2021; v1 submitted 14 February, 2021;
originally announced February 2021.
-
Probabilistic Grammars for Equation Discovery
Authors:
Jure Brence,
Ljupčo Todorovski,
Sašo Džeroski
Abstract:
Equation discovery, also known as symbolic regression, is a type of automated modeling that discovers scientific laws, expressed in the form of equations, from observed data and expert knowledge. Deterministic grammars, such as context-free grammars, have been used to limit the search spaces in equation discovery by providing hard constraints that specify which equations to consider and which not.…
▽ More
Equation discovery, also known as symbolic regression, is a type of automated modeling that discovers scientific laws, expressed in the form of equations, from observed data and expert knowledge. Deterministic grammars, such as context-free grammars, have been used to limit the search spaces in equation discovery by providing hard constraints that specify which equations to consider and which not. In this paper, we propose the use of probabilistic context-free grammars in equation discovery. Such grammars encode soft constraints, specifying a prior probability distribution on the space of possible equations. We show that probabilistic grammars can be used to elegantly and flexibly formulate the parsimony principle, that favors simpler equations, through probabilities attached to the rules in the grammars. We demonstrate that the use of probabilistic, rather than deterministic grammars, in the context of a Monte-Carlo algorithm for grammar-based equation discovery, leads to more efficient equation discovery. Finally, by specifying prior probability distributions over equation spaces, the foundations are laid for Bayesian approaches to equation discovery.
△ Less
Submitted 22 March, 2021; v1 submitted 1 December, 2020;
originally announced December 2020.
-
Equation Discovery for Nonlinear System Identification
Authors:
Nikola Simidjievski,
Ljupčo Todorovski,
Juš Kocijan,
Sašo Džeroski
Abstract:
Equation discovery methods enable modelers to combine domain-specific knowledge and system identification to construct models most suitable for a selected modeling task. The method described and evaluated in this paper can be used as a nonlinear system identification method for gray-box modeling. It consists of two interlaced parts of modeling that are computer-aided. The first performs computer-a…
▽ More
Equation discovery methods enable modelers to combine domain-specific knowledge and system identification to construct models most suitable for a selected modeling task. The method described and evaluated in this paper can be used as a nonlinear system identification method for gray-box modeling. It consists of two interlaced parts of modeling that are computer-aided. The first performs computer-aided identification of a model structure composed of elements selected from user-specified domain-specific modeling knowledge, while the second part performs parameter estimation. In this paper, recent developments of the equation discovery method called process-based modeling, suited for nonlinear system identification, are elaborated and illustrated on two continuous-time case studies. The first case study illustrates the use of the process-based modeling on synthetic data while the second case-study evaluates on measured data for a standard system-identification benchmark. The experimental results clearly demonstrate the ability of process-based modeling to reconstruct both model structure and parameters from measured data.
△ Less
Submitted 1 July, 2019;
originally announced July 2019.
-
Meta-Model Framework for Surrogate-Based Parameter Estimation in Dynamical Systems
Authors:
Žiga Lukšič,
Jovan Tanevski,
Sašo Džeroski,
Ljupčo Todorovski
Abstract:
The central task in modeling complex dynamical systems is parameter estimation. This task involves numerous evaluations of a computationally expensive objective function. Surrogate-based optimization introduces a computationally efficient predictive model that approximates the value of the objective function. The standard approach involves learning a surrogate from training examples that correspon…
▽ More
The central task in modeling complex dynamical systems is parameter estimation. This task involves numerous evaluations of a computationally expensive objective function. Surrogate-based optimization introduces a computationally efficient predictive model that approximates the value of the objective function. The standard approach involves learning a surrogate from training examples that correspond to past evaluations of the objective function. Current surrogate-based optimization methods use static, predefined substitution strategies that decide when to use the surrogate and when the true objective. We introduce a meta-model framework where the substitution strategy is dynamically adapted to the solution space of the given optimization problem. The meta model encapsulates the objective function, the surrogate model and the model of the substitution strategy, as well as components for learning them. The framework can be seamlessly coupled with an arbitrary optimization algorithm without any modification: it replaces the objective function and autonomously decides how to evaluate a given candidate solution. We test the utility of the framework on three tasks of estimating parameters of real-world models of dynamical systems. The results show that the meta model significantly improves the efficiency of optimization, reducing the total number of evaluations of the objective function up to an average of 77%.
△ Less
Submitted 18 December, 2019; v1 submitted 21 June, 2019;
originally announced June 2019.
-
Reconstructing dynamical networks via feature ranking
Authors:
Marc G. Leguia,
Zoran Levnajic,
Ljupco Todorovski,
Bernard Zenko
Abstract:
Empirical data on real complex systems are becoming increasingly available. Parallel to this is the need for new methods of reconstructing (inferring) the topology of networks from time-resolved observations of their node-dynamics. The methods based on physical insights often rely on strong assumptions about the properties and dynamics of the scrutinized network. Here, we use the insights from mac…
▽ More
Empirical data on real complex systems are becoming increasingly available. Parallel to this is the need for new methods of reconstructing (inferring) the topology of networks from time-resolved observations of their node-dynamics. The methods based on physical insights often rely on strong assumptions about the properties and dynamics of the scrutinized network. Here, we use the insights from machine learning to design a new method of network reconstruction that essentially makes no such assumptions. Specifically, we interpret the available trajectories (data) as features, and use two independent feature ranking approaches -- Random forest and RReliefF -- to rank the importance of each node for predicting the value of each other node, which yields the reconstructed adjacency matrix. We show that our method is fairly robust to coupling strength, system size, trajectory length and noise. We also find that the reconstruction quality strongly depends on the dynamical regime.
△ Less
Submitted 26 August, 2019; v1 submitted 11 February, 2019;
originally announced February 2019.
-
Decoupling approximation robustly reconstructs directed dynamical networks
Authors:
Nikola Simidjievski,
Jovan Tanevski,
Bernard Zenko,
Zoran Levnajic,
Ljupco Todorovski,
Saso Dzeroski
Abstract:
Methods for reconstructing the topology of complex networks from time-resolved observations of node dynamics are gaining relevance across scientific disciplines. Of biggest practical interest are methods that make no assumptions about properties of the dynamics, and can cope with noisy, short and incomplete trajectories. Ideal reconstruction in such scenario requires and exhaustive approach of sim…
▽ More
Methods for reconstructing the topology of complex networks from time-resolved observations of node dynamics are gaining relevance across scientific disciplines. Of biggest practical interest are methods that make no assumptions about properties of the dynamics, and can cope with noisy, short and incomplete trajectories. Ideal reconstruction in such scenario requires and exhaustive approach of simulating the dynamics for all possible network configurations and matching the simulated against the actual trajectories, which of course is computationally too costly for any realistic application. Relying on insights from equation discovery and machine learning, we here introduce \textit{decoupling approximation} of dynamical networks and propose a new reconstruction method based on it. Decoupling approximation consists of matching the simulated against the actual trajectories for each node individually rather than for the entire network at once. Despite drastic reduction of the computational cost that this approximation entails, we find our method's performance to be very close to that of the ideal method. In particular, we not only make no assumptions about properties of the trajectories, but provide strong evidence that our methods' performance is largely independent of the dynamical regime at hand. Of crucial relevance for practical applications, we also find our method to be extremely robust to both length and resolution of the trajectories and relatively insensitive to noise.
△ Less
Submitted 7 November, 2018; v1 submitted 8 December, 2017;
originally announced December 2017.
-
The Influence of Feature Representation of Text on the Performance of Document Classification
Authors:
Sanda Martinčić-Ipšić,
Tanja Miličić,
Ljupčo Todorovski
Abstract:
In this paper we perform a comparative analysis of three models for feature representation of text documents in the context of document classification. In particular, we consider the most often used family of models bag-of-words, recently proposed continuous space models word2vec and doc2vec, and the model based on the representation of text documents as language networks. While the bag-of-word mo…
▽ More
In this paper we perform a comparative analysis of three models for feature representation of text documents in the context of document classification. In particular, we consider the most often used family of models bag-of-words, recently proposed continuous space models word2vec and doc2vec, and the model based on the representation of text documents as language networks. While the bag-of-word models have been extensively used for the document classification task, the performance of the other two models for the same task have not been well understood. This is especially true for the network-based model that have been rarely considered for representation of text documents for classification. In this study, we measure the performance of the document classifiers trained using the method of random forests for features generated the three models and their variants. The results of the empirical comparison show that the commonly used bag-of-words model has performance comparable to the one obtained by the emerging continuous-space model of doc2vec. In particular, the low-dimensional variants of doc2vec generating up to 75 features are among the top-performing document representation models. The results finally point out that doc2vec shows a superior performance in the tasks of classifying large documents.
△ Less
Submitted 5 July, 2017;
originally announced July 2017.