-
Making mathematical online resources FAIR: at the example of small phylogenetic trees
Authors:
Tabea Bacher,
Marina Garrote-López,
Christiane Görgen,
Marius J. Neubert
Abstract:
We report on the process of taking an early 2000's mathematical library, the Small Phylogenetic Trees, and transforming it into a FAIR, modern, and sustainable repository for data from algebraic phylogenetics. This process is based on a three-fold strategy: (1) writing a software package which enables the user to reproduce results of the database; (2) setting up a user-friendly new website with cr…
▽ More
We report on the process of taking an early 2000's mathematical library, the Small Phylogenetic Trees, and transforming it into a FAIR, modern, and sustainable repository for data from algebraic phylogenetics. This process is based on a three-fold strategy: (1) writing a software package which enables the user to reproduce results of the database; (2) setting up a user-friendly new website with cross links to theoretical publications, code snippets, and serialized output of computations; and (3) all-the-while documenting the steps we take in order to derive lessons learned which may be generalised to other such projects. This paper addresses (3). (1) is found in https://docs.oscar-system.org/dev/Experimental/AlgebraicStatistics/phylogenetics, and (2) is located at https://algebraicphylogenetics.org.
△ Less
Submitted 4 July, 2025; v1 submitted 18 January, 2025;
originally announced January 2025.
-
Cumulant Tensors in Partitioned Independent Component Analysis
Authors:
Marina Garrote-López,
Monroe Stephenson
Abstract:
In this work, we explore Partitioned Independent Component Analysis (PICA), an extension of the well-established Independent Component Analysis (ICA) framework. Traditionally, ICA focuses on extracting a vector of independent source signals from a linear combination of them defined by a mixing matrix. We aim to provide a comprehensive understanding of the identifiability of this mixing matrix in I…
▽ More
In this work, we explore Partitioned Independent Component Analysis (PICA), an extension of the well-established Independent Component Analysis (ICA) framework. Traditionally, ICA focuses on extracting a vector of independent source signals from a linear combination of them defined by a mixing matrix. We aim to provide a comprehensive understanding of the identifiability of this mixing matrix in ICA. Significant to our investigation, recent developments by Mesters and Zwiernik relax these strict independence requirements, studying the identifiability of the mixing matrix from zero restrictions on cumulant tensors. In this paper, we assume alternative independence conditions, in particular, the PICA case, where only partitions of the sources are mutually independent. We study this case from an algebraic perspective, and our primary result generalizes previous results on the identifiability of the mixing matrix.
△ Less
Submitted 15 February, 2024;
originally announced February 2024.
-
Identifiability of Level-1 Species Networks from Gene Tree Quartets
Authors:
Elizabeth S. Allman,
Hector Baños,
Marina Garrote-Lopez,
John A. Rhodes
Abstract:
When hybridization or other forms of lateral gene transfer have occurred, evolutionary relationships of species are better represented by phylogenetic networks than by trees. While inference of such networks remains challenging, several recently proposed methods are based on quartet concordance factors -- the probabilities that a tree relating a gene sampled from the species displays the possible…
▽ More
When hybridization or other forms of lateral gene transfer have occurred, evolutionary relationships of species are better represented by phylogenetic networks than by trees. While inference of such networks remains challenging, several recently proposed methods are based on quartet concordance factors -- the probabilities that a tree relating a gene sampled from the species displays the possible 4-taxon relationships. Building on earlier results, we investigate what level-1 network features are identifiable from concordance factors under the network multispecies coalescent model. We obtain results on both topological features of the network, and numerical parameters, uncovering a number of failures of identifiability related to 3-cycles in the network.
△ Less
Submitted 11 January, 2024;
originally announced January 2024.
-
Algebraic optimization of sequential decision problems
Authors:
Mareike Dressler,
Marina Garrote-López,
Guido Montúfar,
Johannes Müller,
Kemal Rose
Abstract:
We study the optimization of the expected long-term reward in finite partially observable Markov decision processes over the set of stationary stochastic policies. In the case of deterministic observations, also known as state aggregation, the problem is equivalent to optimizing a linear objective subject to quadratic constraints. We characterize the feasible set of this problem as the intersectio…
▽ More
We study the optimization of the expected long-term reward in finite partially observable Markov decision processes over the set of stationary stochastic policies. In the case of deterministic observations, also known as state aggregation, the problem is equivalent to optimizing a linear objective subject to quadratic constraints. We characterize the feasible set of this problem as the intersection of a product of affine varieties of rank one matrices and a polytope. Based on this description, we obtain bounds on the number of critical points of the optimization problem. Finally, we conduct experiments in which we solve the KKT equations or the Lagrange equations over different boundary components of the feasible set, and compare the result to the theoretical bounds and to other constrained optimization methods.
△ Less
Submitted 17 November, 2022;
originally announced November 2022.
-
Computing algebraic degrees of phylogenetic varieties
Authors:
Luis David Garcia Puente,
Marina Garrote-López,
Elima Shehu
Abstract:
A phylogenetic variety is an algebraic variety parameterized by a statistical model of the evolution of biological sequences along a tree. Understanding this variety is an important problem in the area of algebraic statistics with applications in phylogeny reconstruction. In the broader area of algebra statistics, there have been important theoretical advances in computing certain invariants assoc…
▽ More
A phylogenetic variety is an algebraic variety parameterized by a statistical model of the evolution of biological sequences along a tree. Understanding this variety is an important problem in the area of algebraic statistics with applications in phylogeny reconstruction. In the broader area of algebra statistics, there have been important theoretical advances in computing certain invariants associated to algebraic varieties arising in applications. Beyond the dimension and degree of a variety, one is interested in computing other algebraic degrees, such as the maximum likelihood degree and the Euclidean distance degree. Despite these efforts, the current literature lacks explicit computations of these invariants for the particular case of phylogenetic varieties. In our work, we fill this gap by computing these invariants for phylogenetic varieties arising from the simplest group-based models of nucleotide substitution Cavender-Farris-Neyman model, Jukes-Cantor model, Kimura 2-parameter model, and the Kimura 3-parameter model on small phylogenetic trees with at most 5 leaves.
△ Less
Submitted 9 February, 2024; v1 submitted 5 October, 2022;
originally announced October 2022.
-
Designing weights for quartet-based methods when data is heterogeneous across lineages
Authors:
Marta Casanellas,
Jesús Fernández-Sánchez,
Marina Garrote-López,
Marc Sabaté-Vidales
Abstract:
Homogeneity across lineages is a common assumption in phylogenetics according to which nucleotide substitution rates remain constant in time and do not depend on lineages. This is a simplifying hypothesis which is often adopted to make the process of sequence evolution more tractable. However, its validity has been explored and put into question in several papers. On the other hand, dealing succes…
▽ More
Homogeneity across lineages is a common assumption in phylogenetics according to which nucleotide substitution rates remain constant in time and do not depend on lineages. This is a simplifying hypothesis which is often adopted to make the process of sequence evolution more tractable. However, its validity has been explored and put into question in several papers. On the other hand, dealing successfully with the general case (heterogeneity across lineages) is one of the key features of phylogenetic reconstruction methods based on algebraic tools.
The goal of this paper is twofold. First, we present a new weighting system for quartets (ASAQ) based on algebraic and semi-algebraic tools, thus specially indicated to deal with data evolving under heterogeneus rates. This method combines the weights two previous methods by means of a test based on the positivity of the branch length estimated with the paralinear distance. ASAQ is statistically consistent when applied to GM data, considers rate and base composition heterogeneity among lineages and does not assume stationarity nor time-reversibility. Second, we test and compare the performance of several quartet-based methods for phylogenetic tree reconstruction (namely, Quartet Puzzling, Weight Optimization and Wilson's method) in combination with ASAQ weights and other weights based on algebraic and semi-algebraic methods or on the paralinear distance. These tests are applied to both simulated and real data and support Weight Optimization with ASAQ weights as a reliable and successful reconstruction method.
△ Less
Submitted 27 February, 2022;
originally announced February 2022.
-
Robust estimation of tree structured models
Authors:
Marta Casanellas,
Marina Garrote-López,
Piotr Zwiernik
Abstract:
Consider the problem of learning undirected graphical models on trees from corrupted data. Recently Katiyar et al. showed that it is possible to recover trees from noisy binary data up to a small equivalence class of possible trees. Their other paper on the Gaussian case follows a similar pattern. By framing this as a special phylogenetic recovery problem we largely generalize these two settings.…
▽ More
Consider the problem of learning undirected graphical models on trees from corrupted data. Recently Katiyar et al. showed that it is possible to recover trees from noisy binary data up to a small equivalence class of possible trees. Their other paper on the Gaussian case follows a similar pattern. By framing this as a special phylogenetic recovery problem we largely generalize these two settings. Using the framework of linear latent tree models we discuss tree identifiability for binary data under a continuous corruption model. For the Ising and the Gaussian tree model we also provide a characterisation of when the Chow-Liu algorithm consistently learns the underlying tree from the noisy data.
△ Less
Submitted 10 February, 2021;
originally announced February 2021.
-
SAQ: semi-algebraic quartet reconstruction method
Authors:
Marta Casanellas,
Jesús Fernández-Sánchez,
Marina Garrote-López
Abstract:
We present the phylogenetic quartet reconstruction method SAQ (Semi-algebraic quartet reconstruction). SAQ is consistent with the most general Markov model of nucleotide substitution and, in particular, it allows for rate heterogeneity across lineages. Based on the algebraic and semi-algebraic description of distributions that arise from the general Markov model on a quartet, the method outputs no…
▽ More
We present the phylogenetic quartet reconstruction method SAQ (Semi-algebraic quartet reconstruction). SAQ is consistent with the most general Markov model of nucleotide substitution and, in particular, it allows for rate heterogeneity across lineages. Based on the algebraic and semi-algebraic description of distributions that arise from the general Markov model on a quartet, the method outputs normalized weights for the three trivalent quartets (which can be used as input of quartet-base methods). We show that SAQ is a highly competitive method that outperforms most of the well known reconstruction methods on data simulated under the general Markov model on 4-taxon trees. Moreover, it also achieves a high performance on data that violates the underlying assumptions.
△ Less
Submitted 27 November, 2020;
originally announced November 2020.
-
Distance to the stochastic part of phylogenetic varieties
Authors:
Marta Casanellas,
Jesús Fernández-Sánchez,
Marina Garrote-López
Abstract:
Modelling the substitution of nucleotides along a phylogenetic tree is usually done by a hidden Markov process. This allows to define a distribution of characters at the leaves of the trees and one might be able to obtain polynomial relationships among the probabilities of different characters. The study of these polynomials and the geometry of the algebraic varieties defined by them can be used t…
▽ More
Modelling the substitution of nucleotides along a phylogenetic tree is usually done by a hidden Markov process. This allows to define a distribution of characters at the leaves of the trees and one might be able to obtain polynomial relationships among the probabilities of different characters. The study of these polynomials and the geometry of the algebraic varieties defined by them can be used to reconstruct phylogenetic trees. However, not all points in these algebraic varieties have biological sense. In this paper, we explore the extent to which adding semi-algebraic conditions arising from the restriction to parameters with statistical meaning can improve existing methods of phylogenetic reconstruction. To this end, our aim is to compute the distance of data points to algebraic varieties and to the stochastic part of these varieties. Computing these distances involves optimization by nonlinear programming algorithms. We use analytical methods to find some of these distances for quartet trees evolving under the Kimura 3-parameter or the Jukes-Cantor models. Numerical algebraic geometry and computational algebra play also a fundamental role in this paper.
△ Less
Submitted 9 October, 2020; v1 submitted 4 December, 2019;
originally announced December 2019.