Search | arXiv e-print repository

Making mathematical online resources FAIR: at the example of small phylogenetic trees

Authors: Tabea Bacher, Marina Garrote-López, Christiane Görgen, Marius J. Neubert

Abstract: We report on the process of taking an early 2000's mathematical library, the Small Phylogenetic Trees, and transforming it into a FAIR, modern, and sustainable repository for data from algebraic phylogenetics. This process is based on a three-fold strategy: (1) writing a software package which enables the user to reproduce results of the database; (2) setting up a user-friendly new website with cr… ▽ More We report on the process of taking an early 2000's mathematical library, the Small Phylogenetic Trees, and transforming it into a FAIR, modern, and sustainable repository for data from algebraic phylogenetics. This process is based on a three-fold strategy: (1) writing a software package which enables the user to reproduce results of the database; (2) setting up a user-friendly new website with cross links to theoretical publications, code snippets, and serialized output of computations; and (3) all-the-while documenting the steps we take in order to derive lessons learned which may be generalised to other such projects. This paper addresses (3). (1) is found in https://docs.oscar-system.org/dev/Experimental/AlgebraicStatistics/phylogenetics, and (2) is located at https://algebraicphylogenetics.org. △ Less

Submitted 4 July, 2025; v1 submitted 18 January, 2025; originally announced January 2025.

Comments: 13 pages, 4 figures, Version 0.1 of https://algebraicphylogenetics.org has been launched

arXiv:2402.10089 [pdf, ps, other]

Cumulant Tensors in Partitioned Independent Component Analysis

Authors: Marina Garrote-López, Monroe Stephenson

Abstract: In this work, we explore Partitioned Independent Component Analysis (PICA), an extension of the well-established Independent Component Analysis (ICA) framework. Traditionally, ICA focuses on extracting a vector of independent source signals from a linear combination of them defined by a mixing matrix. We aim to provide a comprehensive understanding of the identifiability of this mixing matrix in I… ▽ More In this work, we explore Partitioned Independent Component Analysis (PICA), an extension of the well-established Independent Component Analysis (ICA) framework. Traditionally, ICA focuses on extracting a vector of independent source signals from a linear combination of them defined by a mixing matrix. We aim to provide a comprehensive understanding of the identifiability of this mixing matrix in ICA. Significant to our investigation, recent developments by Mesters and Zwiernik relax these strict independence requirements, studying the identifiability of the mixing matrix from zero restrictions on cumulant tensors. In this paper, we assume alternative independence conditions, in particular, the PICA case, where only partitions of the sources are mutually independent. We study this case from an algebraic perspective, and our primary result generalizes previous results on the identifiability of the mixing matrix. △ Less

Submitted 15 February, 2024; originally announced February 2024.

MSC Class: 15A69; 62R01; 62H25

arXiv:2401.06290 [pdf, other]

Identifiability of Level-1 Species Networks from Gene Tree Quartets

Authors: Elizabeth S. Allman, Hector Baños, Marina Garrote-Lopez, John A. Rhodes

Abstract: When hybridization or other forms of lateral gene transfer have occurred, evolutionary relationships of species are better represented by phylogenetic networks than by trees. While inference of such networks remains challenging, several recently proposed methods are based on quartet concordance factors -- the probabilities that a tree relating a gene sampled from the species displays the possible… ▽ More When hybridization or other forms of lateral gene transfer have occurred, evolutionary relationships of species are better represented by phylogenetic networks than by trees. While inference of such networks remains challenging, several recently proposed methods are based on quartet concordance factors -- the probabilities that a tree relating a gene sampled from the species displays the possible 4-taxon relationships. Building on earlier results, we investigate what level-1 network features are identifiable from concordance factors under the network multispecies coalescent model. We obtain results on both topological features of the network, and numerical parameters, uncovering a number of failures of identifiability related to 3-cycles in the network. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Comments: 27 pages + 11 pages Supplementary Materials

MSC Class: 92D15

arXiv:2211.09439 [pdf, other]

Algebraic optimization of sequential decision problems

Authors: Mareike Dressler, Marina Garrote-López, Guido Montúfar, Johannes Müller, Kemal Rose

Abstract: We study the optimization of the expected long-term reward in finite partially observable Markov decision processes over the set of stationary stochastic policies. In the case of deterministic observations, also known as state aggregation, the problem is equivalent to optimizing a linear objective subject to quadratic constraints. We characterize the feasible set of this problem as the intersectio… ▽ More We study the optimization of the expected long-term reward in finite partially observable Markov decision processes over the set of stationary stochastic policies. In the case of deterministic observations, also known as state aggregation, the problem is equivalent to optimizing a linear objective subject to quadratic constraints. We characterize the feasible set of this problem as the intersection of a product of affine varieties of rank one matrices and a polytope. Based on this description, we obtain bounds on the number of critical points of the optimization problem. Finally, we conduct experiments in which we solve the KKT equations or the Lagrange equations over different boundary components of the feasible set, and compare the result to the theoretical bounds and to other constrained optimization methods. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: 19 pages, 3 figures

MSC Class: 62R01; 90C23; 90C40

arXiv:2210.02116 [pdf, ps, other]

doi 10.2140/astat.2023.14.215

Computing algebraic degrees of phylogenetic varieties

Authors: Luis David Garcia Puente, Marina Garrote-López, Elima Shehu

Abstract: A phylogenetic variety is an algebraic variety parameterized by a statistical model of the evolution of biological sequences along a tree. Understanding this variety is an important problem in the area of algebraic statistics with applications in phylogeny reconstruction. In the broader area of algebra statistics, there have been important theoretical advances in computing certain invariants assoc… ▽ More A phylogenetic variety is an algebraic variety parameterized by a statistical model of the evolution of biological sequences along a tree. Understanding this variety is an important problem in the area of algebraic statistics with applications in phylogeny reconstruction. In the broader area of algebra statistics, there have been important theoretical advances in computing certain invariants associated to algebraic varieties arising in applications. Beyond the dimension and degree of a variety, one is interested in computing other algebraic degrees, such as the maximum likelihood degree and the Euclidean distance degree. Despite these efforts, the current literature lacks explicit computations of these invariants for the particular case of phylogenetic varieties. In our work, we fill this gap by computing these invariants for phylogenetic varieties arising from the simplest group-based models of nucleotide substitution Cavender-Farris-Neyman model, Jukes-Cantor model, Kimura 2-parameter model, and the Kimura 3-parameter model on small phylogenetic trees with at most 5 leaves. △ Less

Submitted 9 February, 2024; v1 submitted 5 October, 2022; originally announced October 2022.

Journal ref: Alg. Stat. 14 (2023) 215-231

arXiv:2202.13365 [pdf, other]

Designing weights for quartet-based methods when data is heterogeneous across lineages

Authors: Marta Casanellas, Jesús Fernández-Sánchez, Marina Garrote-López, Marc Sabaté-Vidales

Abstract: Homogeneity across lineages is a common assumption in phylogenetics according to which nucleotide substitution rates remain constant in time and do not depend on lineages. This is a simplifying hypothesis which is often adopted to make the process of sequence evolution more tractable. However, its validity has been explored and put into question in several papers. On the other hand, dealing succes… ▽ More Homogeneity across lineages is a common assumption in phylogenetics according to which nucleotide substitution rates remain constant in time and do not depend on lineages. This is a simplifying hypothesis which is often adopted to make the process of sequence evolution more tractable. However, its validity has been explored and put into question in several papers. On the other hand, dealing successfully with the general case (heterogeneity across lineages) is one of the key features of phylogenetic reconstruction methods based on algebraic tools. The goal of this paper is twofold. First, we present a new weighting system for quartets (ASAQ) based on algebraic and semi-algebraic tools, thus specially indicated to deal with data evolving under heterogeneus rates. This method combines the weights two previous methods by means of a test based on the positivity of the branch length estimated with the paralinear distance. ASAQ is statistically consistent when applied to GM data, considers rate and base composition heterogeneity among lineages and does not assume stationarity nor time-reversibility. Second, we test and compare the performance of several quartet-based methods for phylogenetic tree reconstruction (namely, Quartet Puzzling, Weight Optimization and Wilson's method) in combination with ASAQ weights and other weights based on algebraic and semi-algebraic methods or on the paralinear distance. These tests are applied to both simulated and real data and support Weight Optimization with ASAQ weights as a reliable and successful reconstruction method. △ Less

Submitted 27 February, 2022; originally announced February 2022.

Comments: Main article: 25 pages, 6 figures, 4 tables; 1 appendix: 22 pages, 11 figures, 7 yables

MSC Class: 92D15

arXiv:2102.05472 [pdf, other]

Robust estimation of tree structured models

Authors: Marta Casanellas, Marina Garrote-López, Piotr Zwiernik

Abstract: Consider the problem of learning undirected graphical models on trees from corrupted data. Recently Katiyar et al. showed that it is possible to recover trees from noisy binary data up to a small equivalence class of possible trees. Their other paper on the Gaussian case follows a similar pattern. By framing this as a special phylogenetic recovery problem we largely generalize these two settings.… ▽ More Consider the problem of learning undirected graphical models on trees from corrupted data. Recently Katiyar et al. showed that it is possible to recover trees from noisy binary data up to a small equivalence class of possible trees. Their other paper on the Gaussian case follows a similar pattern. By framing this as a special phylogenetic recovery problem we largely generalize these two settings. Using the framework of linear latent tree models we discuss tree identifiability for binary data under a continuous corruption model. For the Ising and the Gaussian tree model we also provide a characterisation of when the Chow-Liu algorithm consistently learns the underlying tree from the noisy data. △ Less

Submitted 10 February, 2021; originally announced February 2021.

MSC Class: 62H22; 62R01; 60J20

arXiv:2011.13968 [pdf, other]

SAQ: semi-algebraic quartet reconstruction method

Authors: Marta Casanellas, Jesús Fernández-Sánchez, Marina Garrote-López

Abstract: We present the phylogenetic quartet reconstruction method SAQ (Semi-algebraic quartet reconstruction). SAQ is consistent with the most general Markov model of nucleotide substitution and, in particular, it allows for rate heterogeneity across lineages. Based on the algebraic and semi-algebraic description of distributions that arise from the general Markov model on a quartet, the method outputs no… ▽ More We present the phylogenetic quartet reconstruction method SAQ (Semi-algebraic quartet reconstruction). SAQ is consistent with the most general Markov model of nucleotide substitution and, in particular, it allows for rate heterogeneity across lineages. Based on the algebraic and semi-algebraic description of distributions that arise from the general Markov model on a quartet, the method outputs normalized weights for the three trivalent quartets (which can be used as input of quartet-base methods). We show that SAQ is a highly competitive method that outperforms most of the well known reconstruction methods on data simulated under the general Markov model on 4-taxon trees. Moreover, it also achieves a high performance on data that violates the underlying assumptions. △ Less

Submitted 27 November, 2020; originally announced November 2020.

arXiv:1912.02138 [pdf, other]

doi 10.1016/j.jsc.2020.09.003

Distance to the stochastic part of phylogenetic varieties

Authors: Marta Casanellas, Jesús Fernández-Sánchez, Marina Garrote-López

Abstract: Modelling the substitution of nucleotides along a phylogenetic tree is usually done by a hidden Markov process. This allows to define a distribution of characters at the leaves of the trees and one might be able to obtain polynomial relationships among the probabilities of different characters. The study of these polynomials and the geometry of the algebraic varieties defined by them can be used t… ▽ More Modelling the substitution of nucleotides along a phylogenetic tree is usually done by a hidden Markov process. This allows to define a distribution of characters at the leaves of the trees and one might be able to obtain polynomial relationships among the probabilities of different characters. The study of these polynomials and the geometry of the algebraic varieties defined by them can be used to reconstruct phylogenetic trees. However, not all points in these algebraic varieties have biological sense. In this paper, we explore the extent to which adding semi-algebraic conditions arising from the restriction to parameters with statistical meaning can improve existing methods of phylogenetic reconstruction. To this end, our aim is to compute the distance of data points to algebraic varieties and to the stochastic part of these varieties. Computing these distances involves optimization by nonlinear programming algorithms. We use analytical methods to find some of these distances for quartet trees evolving under the Kimura 3-parameter or the Jukes-Cantor models. Numerical algebraic geometry and computational algebra play also a fundamental role in this paper. △ Less

Submitted 9 October, 2020; v1 submitted 4 December, 2019; originally announced December 2019.

Comments: 33 pages; 11 figures; to appear in Journal of Symbolic Computation

MSC Class: 14P10; 14Q30; 62R01

Showing 1–9 of 9 results for author: Garrote-López, M