-
For high-dimensional hierarchical models, consider exchangeability of effects across covariates instead of across datasets
Authors:
Brian L. Trippe,
Hilary K. Finucane,
Tamara Broderick
Abstract:
Hierarchical Bayesian methods enable information sharing across multiple related regression problems. While standard practice is to model regression parameters (effects) as (1) exchangeable across datasets and (2) correlated to differing degrees across covariates, we show that this approach exhibits poor statistical performance when the number of covariates exceeds the number of datasets. For inst…
▽ More
Hierarchical Bayesian methods enable information sharing across multiple related regression problems. While standard practice is to model regression parameters (effects) as (1) exchangeable across datasets and (2) correlated to differing degrees across covariates, we show that this approach exhibits poor statistical performance when the number of covariates exceeds the number of datasets. For instance, in statistical genetics, we might regress dozens of traits (defining datasets) for thousands of individuals (responses) on up to millions of genetic variants (covariates). When an analyst has more covariates than datasets, we argue that it is often more natural to instead model effects as (1) exchangeable across covariates and (2) correlated to differing degrees across datasets. To this end, we propose a hierarchical model expressing our alternative perspective. We devise an empirical Bayes estimator for learning the degree of correlation between datasets. We develop theory that demonstrates that our method outperforms the classic approach when the number of covariates dominates the number of datasets, and corroborate this result empirically on several high-dimensional multiple regression and classification problems.
△ Less
Submitted 13 July, 2021;
originally announced July 2021.
-
Measuring dependence powerfully and equitably
Authors:
Yakir A. Reshef,
David N. Reshef,
Hilary K. Finucane,
Pardis C. Sabeti,
Michael M. Mitzenmacher
Abstract:
Given a high-dimensional data set we often wish to find the strongest relationships within it. A common strategy is to evaluate a measure of dependence on every variable pair and retain the highest-scoring pairs for follow-up. This strategy works well if the statistic used is equitable [Reshef et al. 2015a], i.e., if, for some measure of noise, it assigns similar scores to equally noisy relationsh…
▽ More
Given a high-dimensional data set we often wish to find the strongest relationships within it. A common strategy is to evaluate a measure of dependence on every variable pair and retain the highest-scoring pairs for follow-up. This strategy works well if the statistic used is equitable [Reshef et al. 2015a], i.e., if, for some measure of noise, it assigns similar scores to equally noisy relationships regardless of relationship type (e.g., linear, exponential, periodic).
In this paper, we introduce and characterize a population measure of dependence called MIC*. We show three ways that MIC* can be viewed: as the population value of MIC, a highly equitable statistic from [Reshef et al. 2011], as a canonical "smoothing" of mutual information, and as the supremum of an infinite sequence defined in terms of optimal one-dimensional partitions of the marginals of the joint distribution. Based on this theory, we introduce an efficient approach for computing MIC* from the density of a pair of random variables, and we define a new consistent estimator MICe for MIC* that is efficiently computable. In contrast, there is no known polynomial-time algorithm for computing the original equitable statistic MIC. We show through simulations that MICe has better bias-variance properties than MIC. We then introduce and prove the consistency of a second statistic, TICe, that is a trivial side-product of the computation of MICe and whose goal is powerful independence testing rather than equitability.
We show in simulations that MICe and TICe have good equitability and power against independence respectively. The analyses here complement a more in-depth empirical evaluation of several leading measures of dependence [Reshef et al. 2015b] that shows state-of-the-art performance for MICe and TICe.
△ Less
Submitted 30 August, 2021; v1 submitted 8 May, 2015;
originally announced May 2015.
-
Algebraically recurrent random walks on groups
Authors:
Itai Benjamini,
Hilary Finucane,
Romain Tessera
Abstract:
Initial steps are presented towards understanding which finitely generated groups are almost surely generated as semigroups by the path of a random walk on the group.
Initial steps are presented towards understanding which finitely generated groups are almost surely generated as semigroups by the path of a random walk on the group.
△ Less
Submitted 24 December, 2012; v1 submitted 16 July, 2012;
originally announced July 2012.
-
On the scaling limit of finite vertex transitive graphs with large diameter
Authors:
Itai Benjamini,
Hilary Finucane,
Romain Tessera
Abstract:
Let $(X_n)$ be an unbounded sequence of finite, connected, vertex transitive graphs such that $ |X_n | = o(diam(X_n)^q)$ for some $q>0$. We show that up to taking a subsequence, and after rescaling by the diameter, the sequence $(X_n)$ converges in the Gromov Hausdorff distance to a torus of dimension $<q$, equipped with some invariant Finsler metric. The proof relies on a recent quantitative vers…
▽ More
Let $(X_n)$ be an unbounded sequence of finite, connected, vertex transitive graphs such that $ |X_n | = o(diam(X_n)^q)$ for some $q>0$. We show that up to taking a subsequence, and after rescaling by the diameter, the sequence $(X_n)$ converges in the Gromov Hausdorff distance to a torus of dimension $<q$, equipped with some invariant Finsler metric. The proof relies on a recent quantitative version of Gromov's theorem on groups with polynomial growth obtained by Breuillard, Green and Tao. If $X_n$ is only roughly transitive and $|X_n| = o\bigl({diam(X_n)^δ}\bigr)$ for $δ> 1$ sufficiently small, we prove, this time by elementary means, that $(X_n)$ converges to a circle.
△ Less
Submitted 26 August, 2014; v1 submitted 26 March, 2012;
originally announced March 2012.
-
A recursive construction of t-wise uniform permutations
Authors:
Hilary Finucane,
Ron Peled,
Yariv Yaari
Abstract:
We present a recursive construction of a (2t + 1)-wise uniform set of permutations on 2n objects using a (2t + 1) - (2n, n, \cdot) combinatorial design, a t-wise uniform set of permutations on n objects and a (2t+1)-wise uniform set of permutations on n objects. Using the complete design in this procedure gives a t-wise uniform set of permutations on n objects whose size is at most t^2n, the first…
▽ More
We present a recursive construction of a (2t + 1)-wise uniform set of permutations on 2n objects using a (2t + 1) - (2n, n, \cdot) combinatorial design, a t-wise uniform set of permutations on n objects and a (2t+1)-wise uniform set of permutations on n objects. Using the complete design in this procedure gives a t-wise uniform set of permutations on n objects whose size is at most t^2n, the first non-trivial construction of an infinite family of t-wise uniform sets for t \geq 4. If a non-trivial design with suitable parameters is found, it will imply a corresponding improvement in the construction.
△ Less
Submitted 4 November, 2012; v1 submitted 24 January, 2012;
originally announced January 2012.
-
Finite Voronoi decompositions of infinite vertex transitive graphs
Authors:
Hilary Finucane
Abstract:
In this paper, we consider the Voronoi decompositions of an arbitrary infinite vertex-transitive graph G. In particular, we are interested in the following question: what is the largest number of Voronoi cells that must be infinite, given sufficiently (but finitely) many Voronoi sites which are sufficiently far from each other? We call this number the survival number s(G).
The survival number of…
▽ More
In this paper, we consider the Voronoi decompositions of an arbitrary infinite vertex-transitive graph G. In particular, we are interested in the following question: what is the largest number of Voronoi cells that must be infinite, given sufficiently (but finitely) many Voronoi sites which are sufficiently far from each other? We call this number the survival number s(G).
The survival number of a graph has an alternative characterization in terms of covering, which we use to show that s(G) is always at least two. The survival number is not a quasi-isometry invariant, but it remains open whether finiteness of the s(G) is. We show that all vertex transitive graphs with polynomial growth have a finite s(G); vertex transitive graphs with infinitely many ends have an infinite s(G); the lamplighter graph LL(Z), which has exponential growth, has a finite s(G); and the lamplighter graph LL(Z^2), which is Liouville, has an infinite s(G).
△ Less
Submitted 2 November, 2011;
originally announced November 2011.
-
Scenery Reconstruction on Finite Abelian Groups
Authors:
Hilary Finucane,
Omer Tamuz,
Yariv Yaari
Abstract:
We consider the question of when a random walk on a finite abelian group with a given step distribution can be used to reconstruct a binary labeling of the elements of the group, up to a shift. Matzinger and Lember (2006) give a sufficient condition for reconstructibility on cycles. While, as we show, this condition is not in general necessary, our main result is that it is necessary when the leng…
▽ More
We consider the question of when a random walk on a finite abelian group with a given step distribution can be used to reconstruct a binary labeling of the elements of the group, up to a shift. Matzinger and Lember (2006) give a sufficient condition for reconstructibility on cycles. While, as we show, this condition is not in general necessary, our main result is that it is necessary when the length of the cycle is prime and larger than 5, and the step distribution has only rational probabilities. We extend this result to other abelian groups.
△ Less
Submitted 30 April, 2014; v1 submitted 27 May, 2011;
originally announced May 2011.
-
Comparing Pedigree Graphs
Authors:
Bonnie Kirkpatrick,
Yakir Reshef,
Hilary Finucane,
Haitao Jiang,
Binhai Zhu,
Richard M. Karp
Abstract:
Pedigree graphs, or family trees, are typically constructed by an expensive process of examining genealogical records to determine which pairs of individuals are parent and child. New methods to automate this process take as input genetic data from a set of extant individuals and reconstruct ancestral individuals. There is a great need to evaluate the quality of these methods by comparing the esti…
▽ More
Pedigree graphs, or family trees, are typically constructed by an expensive process of examining genealogical records to determine which pairs of individuals are parent and child. New methods to automate this process take as input genetic data from a set of extant individuals and reconstruct ancestral individuals. There is a great need to evaluate the quality of these methods by comparing the estimated pedigree to the true pedigree.
In this paper, we consider two main pedigree comparison problems. The first is the pedigree isomorphism problem, for which we present a linear-time algorithm for leaf-labeled pedigrees. The second is the pedigree edit distance problem, for which we present 1) several algorithms that are fast and exact in various special cases, and 2) a general, randomized heuristic algorithm.
In the negative direction, we first prove that the pedigree isomorphism problem is as hard as the general graph isomorphism problem, and that the sub-pedigree isomorphism problem is NP-hard. We then show that the pedigree edit distance problem is APX-hard in general and NP-hard on leaf-labeled pedigrees.
We use simulated pedigrees to compare our edit-distance algorithms to each other as well as to a branch-and-bound algorithm that always finds an optimal solution.
△ Less
Submitted 18 October, 2011; v1 submitted 5 September, 2010;
originally announced September 2010.