-
Causal Models for Growing Networks
Authors:
Gecia Bravo-Hermsdorff,
Lee M. Gunderson,
Kayvan Sadeghi
Abstract:
Real-world networks grow over time; statistical models based on node exchangeability are not appropriate. Instead of constraining the structure of the \textit{distribution} of edges, we propose that the relevant symmetries refer to the \textit{causal structure} between them. We first enumerate the 96 causal directed acyclic graph (DAG) models over pairs of nodes (dyad variables) in a growing netwo…
▽ More
Real-world networks grow over time; statistical models based on node exchangeability are not appropriate. Instead of constraining the structure of the \textit{distribution} of edges, we propose that the relevant symmetries refer to the \textit{causal structure} between them. We first enumerate the 96 causal directed acyclic graph (DAG) models over pairs of nodes (dyad variables) in a growing network with finite ancestral sets that are invariant to node deletion. We then partition them into 21 classes with ancestral sets that are closed under node marginalization. Several of these classes are remarkably amenable to distributed and asynchronous evaluation. As an example, we highlight a simple model that exhibits flexible power-law degree distributions and emergent phase transitions in sparsity, which we characterize analytically. With few parameters and much conditional independence, our proposed framework provides natural baseline models for causal inference in relational data.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
BudgetIV: Optimal Partial Identification of Causal Effects with Mostly Invalid Instruments
Authors:
Jordan Penn,
Lee M. Gunderson,
Gecia Bravo-Hermsdorff,
Ricardo Silva,
David S. Watson
Abstract:
Instrumental variables (IVs) are widely used to estimate causal effects in the presence of unobserved confounding between exposure and outcome. An IV must affect the outcome exclusively through the exposure and be unconfounded with the outcome. We present a framework for relaxing either or both of these strong assumptions with tuneable and interpretable budget constraints. Our algorithm returns a…
▽ More
Instrumental variables (IVs) are widely used to estimate causal effects in the presence of unobserved confounding between exposure and outcome. An IV must affect the outcome exclusively through the exposure and be unconfounded with the outcome. We present a framework for relaxing either or both of these strong assumptions with tuneable and interpretable budget constraints. Our algorithm returns a feasible set of causal effects that can be identified exactly given relevant covariance parameters. The feasible set may be disconnected but is a finite union of convex subsets. We discuss conditions under which this set is sharp, i.e., contains all and only effects consistent with the background assumptions and the joint distribution of observable variables. Our method applies to a wide class of semiparametric models, and we demonstrate how its ability to select specific subsets of instruments confers an advantage over convex relaxations in both linear and nonlinear settings. We also adapt our algorithm to form confidence sets that are asymptotically valid under a common statistical assumption from the Mendelian randomization literature.
△ Less
Submitted 16 March, 2025; v1 submitted 11 November, 2024;
originally announced November 2024.
-
Bounding Causal Effects with Leaky Instruments
Authors:
David S. Watson,
Jordan Penn,
Lee M. Gunderson,
Gecia Bravo-Hermsdorff,
Afsaneh Mastouri,
Ricardo Silva
Abstract:
Instrumental variables (IVs) are a popular and powerful tool for estimating causal effects in the presence of unobserved confounding. However, classical approaches rely on strong assumptions such as the $\textit{exclusion criterion}$, which states that instrumental effects must be entirely mediated by treatments. This assumption often fails in practice. When IV methods are improperly applied to da…
▽ More
Instrumental variables (IVs) are a popular and powerful tool for estimating causal effects in the presence of unobserved confounding. However, classical approaches rely on strong assumptions such as the $\textit{exclusion criterion}$, which states that instrumental effects must be entirely mediated by treatments. This assumption often fails in practice. When IV methods are improperly applied to data that do not meet the exclusion criterion, estimated causal effects may be badly biased. In this work, we propose a novel solution that provides $\textit{partial}$ identification in linear systems given a set of $\textit{leaky instruments}$, which are allowed to violate the exclusion criterion to some limited degree. We derive a convex optimization objective that provides provably sharp bounds on the average treatment effect under some common forms of information leakage, and implement inference procedures to quantify the uncertainty of resulting estimates. We demonstrate our method in a set of experiments with simulated data, where it performs favorably against the state of the art. An accompanying $\texttt{R}$ package, $\texttt{leakyIV}$, is available from $\texttt{CRAN}$.
△ Less
Submitted 8 May, 2024; v1 submitted 5 April, 2024;
originally announced April 2024.
-
Quantifying Human Priors over Social and Navigation Networks
Authors:
Gecia Bravo-Hermsdorff
Abstract:
Human knowledge is largely implicit and relational -- do we have a friend in common? can I walk from here to there? In this work, we leverage the combinatorial structure of graphs to quantify human priors over such relational data. Our experiments focus on two domains that have been continuously relevant over evolutionary timescales: social interaction and spatial navigation. We find that some fea…
▽ More
Human knowledge is largely implicit and relational -- do we have a friend in common? can I walk from here to there? In this work, we leverage the combinatorial structure of graphs to quantify human priors over such relational data. Our experiments focus on two domains that have been continuously relevant over evolutionary timescales: social interaction and spatial navigation. We find that some features of the inferred priors are remarkably consistent, such as the tendency for sparsity as a function of graph size. Other features are domain-specific, such as the propensity for triadic closure in social interactions. More broadly, our work demonstrates how nonclassical statistical analysis of indirect behavioral experiments can be used to efficiently model latent biases in the data.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
The Graph Pencil Method: Mapping Subgraph Densities to Stochastic Block Models
Authors:
Lee M Gunderson,
Gecia Bravo-Hermsdorff,
Peter Orbanz
Abstract:
In this work, we describe a method that determines an exact map from a finite set of subgraph densities to the parameters of a stochastic block model (SBM) matching these densities. Given a number $K$ of blocks, the subgraph densities of a finite number of stars and bistars uniquely determines a single element of the class of all degree-separated stochastic block models with $K$ blocks. Our method…
▽ More
In this work, we describe a method that determines an exact map from a finite set of subgraph densities to the parameters of a stochastic block model (SBM) matching these densities. Given a number $K$ of blocks, the subgraph densities of a finite number of stars and bistars uniquely determines a single element of the class of all degree-separated stochastic block models with $K$ blocks. Our method makes it possible to translate estimates of these subgraph densities into model parameters, and hence to use subgraph densities directly for inference. The computational overhead is negligible; computing the translation map is polynomial in $K$, but independent of the graph size once the subgraph densities are given.
△ Less
Submitted 31 January, 2024;
originally announced February 2024.
-
Intervention Generalization: A View from Factor Graph Models
Authors:
Gecia Bravo-Hermsdorff,
David S. Watson,
Jialin Yu,
Jakob Zeitler,
Ricardo Silva
Abstract:
One of the goals of causal inference is to generalize from past experiments and observational data to novel conditions. While it is in principle possible to eventually learn a mapping from a novel experimental condition to an outcome of interest, provided a sufficient variety of experiments is available in the training data, coping with a large combinatorial space of possible interventions is hard…
▽ More
One of the goals of causal inference is to generalize from past experiments and observational data to novel conditions. While it is in principle possible to eventually learn a mapping from a novel experimental condition to an outcome of interest, provided a sufficient variety of experiments is available in the training data, coping with a large combinatorial space of possible interventions is hard. Under a typical sparse experimental design, this mapping is ill-posed without relying on heavy regularization or prior distributions. Such assumptions may or may not be reliable, and can be hard to defend or test. In this paper, we take a close look at how to warrant a leap from past experiments to novel conditions based on minimal assumptions about the factorization of the distribution of the manipulated system, communicated in the well-understood language of factor graph models. A postulated $\textit{interventional factor model}$ (IFM) may not always be informative, but it conveniently abstracts away a need for explicitly modeling unmeasured confounding and feedback mechanisms, leading to directly testable claims. Given an IFM and datasets from a collection of experimental regimes, we derive conditions for identifiability of the expected outcomes of new regimes never observed in these training data. We implement our framework using several efficient algorithms, and apply them on a range of semi-synthetic experiments.
△ Less
Submitted 8 November, 2023; v1 submitted 6 June, 2023;
originally announced June 2023.
-
Private and Communication-Efficient Algorithms for Entropy Estimation
Authors:
Gecia Bravo-Hermsdorff,
Róbert Busa-Fekete,
Mohammad Ghavamzadeh,
Andres Muñoz Medina,
Umar Syed
Abstract:
Modern statistical estimation is often performed in a distributed setting where each sample belongs to a single user who shares their data with a central server. Users are typically concerned with preserving the privacy of their samples, and also with minimizing the amount of data they must transmit to the server. We give improved private and communication-efficient algorithms for estimating sever…
▽ More
Modern statistical estimation is often performed in a distributed setting where each sample belongs to a single user who shares their data with a central server. Users are typically concerned with preserving the privacy of their samples, and also with minimizing the amount of data they must transmit to the server. We give improved private and communication-efficient algorithms for estimating several popular measures of the entropy of a distribution. All of our algorithms have constant communication cost and satisfy local differential privacy. For a joint distribution over many variables whose conditional independence is given by a tree, we describe algorithms for estimating Shannon entropy that require a number of samples that is linear in the number of variables, compared to the quadratic sample complexity of prior work. We also describe an algorithm for estimating Gini entropy whose sample complexity has no dependence on the support size of the distribution and can be implemented using a single round of concurrent communication between the users and the server. In contrast, the previously best-known algorithm has high communication cost and requires the server to facilitate interaction between the users. Finally, we describe an algorithm for estimating collision entropy that generalizes the best known algorithm to the private and communication-efficient setting.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
Statistical anonymity: Quantifying reidentification risks without reidentifying users
Authors:
Gecia Bravo-Hermsdorff,
Robert Busa-Fekete,
Lee M. Gunderson,
Andrés Munõz Medina,
Umar Syed
Abstract:
Data anonymization is an approach to privacy-preserving data release aimed at preventing participants reidentification, and it is an important alternative to differential privacy in applications that cannot tolerate noisy data. Existing algorithms for enforcing $k$-anonymity in the released data assume that the curator performing the anonymization has complete access to the original data. Reasons…
▽ More
Data anonymization is an approach to privacy-preserving data release aimed at preventing participants reidentification, and it is an important alternative to differential privacy in applications that cannot tolerate noisy data. Existing algorithms for enforcing $k$-anonymity in the released data assume that the curator performing the anonymization has complete access to the original data. Reasons for limiting this access range from undesirability to complete infeasibility. This paper explores ideas -- objectives, metrics, protocols, and extensions -- for reducing the trust that must be placed in the curator, while still maintaining a statistical notion of $k$-anonymity. We suggest trust (amount of information provided to the curator) and privacy (anonymity of the participants) as the primary objectives of such a framework. We describe a class of protocols aimed at achieving these goals, proposing new metrics of privacy in the process, and proving related bounds. We conclude by discussing a natural extension of this work that completely removes the need for a central curator.
△ Less
Submitted 28 January, 2022;
originally announced January 2022.
-
Quantifying Network Similarity using Graph Cumulants
Authors:
Gecia Bravo-Hermsdorff,
Lee M. Gunderson,
Pierre-André Maugis,
Carey E. Priebe
Abstract:
How might one test the hypothesis that networks were sampled from the same distribution? Here, we compare two statistical tests that use subgraph counts to address this question. The first uses the empirical subgraph densities themselves as estimates of those of the underlying distribution. The second test uses a new approach that converts these subgraph densities into estimates of the \textit{gra…
▽ More
How might one test the hypothesis that networks were sampled from the same distribution? Here, we compare two statistical tests that use subgraph counts to address this question. The first uses the empirical subgraph densities themselves as estimates of those of the underlying distribution. The second test uses a new approach that converts these subgraph densities into estimates of the \textit{graph cumulants} of the distribution (without any increase in computational complexity). We demonstrate -- via theory, simulation, and application to real data -- the superior statistical power of using graph cumulants. In summary, when analyzing data using subgraph/motif densities, we suggest using the corresponding graph cumulants instead.
△ Less
Submitted 18 July, 2023; v1 submitted 23 July, 2021;
originally announced July 2021.
-
Gender and collaboration patterns in a temporal scientific authorship network
Authors:
Gecia Bravo-Hermsdorff,
Valkyrie Felso,
Emily Ray,
Lee M. Gunderson,
Mary E. Helander,
Joana Maria,
Yael Niv
Abstract:
One can point to a variety of historical milestones for gender equality in STEM (science, technology, engineering, and mathematics), however, practical effects are incremental and ongoing. It is important to quantify gender differences in subdomains of scientific work in order to detect potential biases and monitor progress. In this work, we study the relevance of gender in scientific collaboratio…
▽ More
One can point to a variety of historical milestones for gender equality in STEM (science, technology, engineering, and mathematics), however, practical effects are incremental and ongoing. It is important to quantify gender differences in subdomains of scientific work in order to detect potential biases and monitor progress. In this work, we study the relevance of gender in scientific collaboration patterns in the Institute for Operations Research and the Management Sciences (INFORMS), a professional society with sixteen peer-reviewed journals. Using their publication data from 1952 to 2016, we constructed a large temporal bipartite network between authors and publications, and augmented the author nodes with gender labels. We characterized differences in several basic statistics of this network over time, highlighting how they have changed with respect to relevant historical events. We find a steady increase in participation by women (e.g., fraction of authorships by women and of new women authors) starting around 1980. However, women still comprise less than 25% of the INFORMS society and an even smaller fraction of authors with many publications. Moreover, we describe a methodology for quantifying the structural role of an authorship with respect to the overall connectivity of the network, using it to measure subtle differences between authorships by women and by men. Specifically, as measures of structural importance of an authorship, we use effective resistance and contraction importance, two measures related to diffusion throughout a network. As a null model, we propose a degree-preserving temporal and geometric network model with emergent communities. Our results suggest the presence of systematic differences between the collaboration patterns of men and women that cannot be explained by only local statistics.
△ Less
Submitted 27 May, 2020;
originally announced May 2020.
-
Introducing Graph Cumulants: What is the Variance of Your Social Network?
Authors:
Lee M. Gunderson,
Gecia Bravo-Hermsdorff
Abstract:
In an increasingly interconnected world, understanding and summarizing the structure of these networks becomes increasingly relevant. However, this task is nontrivial; proposed summary statistics are as diverse as the networks they describe, and a standardized hierarchy has not yet been established. In contrast, vector-valued random variables admit such a description in terms of their cumulants (e…
▽ More
In an increasingly interconnected world, understanding and summarizing the structure of these networks becomes increasingly relevant. However, this task is nontrivial; proposed summary statistics are as diverse as the networks they describe, and a standardized hierarchy has not yet been established. In contrast, vector-valued random variables admit such a description in terms of their cumulants (e.g., mean, (co)variance, skew, kurtosis). Here, we introduce the natural analogue of cumulants for networks, building a hierarchical description based on correlations between an increasing number of connections, seamlessly incorporating additional information, such as directed edges, node attributes, and edge weights. These graph cumulants provide a principled and unifying framework for quantifying the propensity of a network to display any substructure of interest (such as cliques to measure clustering). Moreover, they give rise to a natural hierarchical family of maximum entropy models for networks (i.e., ERGMs) that do not suffer from the "degeneracy problem", a common practical pitfall of other ERGMs.
△ Less
Submitted 14 April, 2020; v1 submitted 10 February, 2020;
originally announced February 2020.
-
A Unifying Framework for Spectrum-Preserving Graph Sparsification and Coarsening
Authors:
Gecia Bravo-Hermsdorff,
Lee M. Gunderson
Abstract:
How might one "reduce" a graph? That is, generate a smaller graph that preserves the global structure at the expense of discarding local details? There has been extensive work on both graph sparsification (removing edges) and graph coarsening (merging nodes, often by edge contraction); however, these operations are currently treated separately. Interestingly, for a planar graph, edge deletion corr…
▽ More
How might one "reduce" a graph? That is, generate a smaller graph that preserves the global structure at the expense of discarding local details? There has been extensive work on both graph sparsification (removing edges) and graph coarsening (merging nodes, often by edge contraction); however, these operations are currently treated separately. Interestingly, for a planar graph, edge deletion corresponds to edge contraction in its planar dual (and more generally, for a graphical matroid and its dual). Moreover, with respect to the dynamics induced by the graph Laplacian (e.g., diffusion), deletion and contraction are physical manifestations of two reciprocal limits: edge weights of $0$ and $\infty$, respectively. In this work, we provide a unifying framework that captures both of these operations, allowing one to simultaneously sparsify and coarsen a graph while preserving its large-scale structure. The limit of infinite edge weight is rarely considered, as many classical notions of graph similarity diverge. However, its algebraic, geometric, and physical interpretations are reflected in the Laplacian pseudoinverse $\mathbf{\mathit{L}}^{\dagger}$, which remains finite in this limit. Motivated by this insight, we provide a probabilistic algorithm that reduces graphs while preserving $\mathbf{\mathit{L}}^{\dagger}$, using an unbiased procedure that minimizes its variance. We compare our algorithm with several existing sparsification and coarsening algorithms using real-world datasets, and demonstrate that it more accurately preserves the large-scale structure.
△ Less
Submitted 16 February, 2020; v1 submitted 25 February, 2019;
originally announced February 2019.