-
Minority Representation in Network Rankings: Methods for Estimation, Testing, and Fairness
Authors:
Hui Shen,
Peter W. MacDonald,
Eric D. Kolaczyk
Abstract:
Networks, composed of nodes and their connections, are widely used to model complex relationships across various fields. Centrality metrics often inform decisions such as identifying key nodes or prioritizing resources. However, networks frequently suffer from missing or incorrect edges, which can systematically centrality-based decisions and distort the representation of certain protected groups.…
▽ More
Networks, composed of nodes and their connections, are widely used to model complex relationships across various fields. Centrality metrics often inform decisions such as identifying key nodes or prioritizing resources. However, networks frequently suffer from missing or incorrect edges, which can systematically centrality-based decisions and distort the representation of certain protected groups. To address this issue, we introduce a formal definition of minority representation, measured as the proportion of minority nodes among the top-ranked nodes. We model systematic bias against minority groups by using group-dependent missing edge errors. We propose methods to estimate and detect systematic bias. Asymptotic limits of minority representation statistics are derived under canonical network models and used to correct representation of minority groups in node rankings. Simulation results demonstrate the effectiveness of our estimation, testing, and ranking correction procedures, and we apply our methods to a contact network, showcasing their practical applicability.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
Mesoscale two-sample testing for network data
Authors:
Peter W. MacDonald,
Elizaveta Levina,
Ji Zhu
Abstract:
Networks arise naturally in many scientific fields as a representation of pairwise connections. Statistical network analysis has most often considered a single large network, but it is common in a number of applications, for example, neuroimaging, to observe multiple networks on a shared node set. When these networks are grouped by case-control status or another categorical covariate, the classica…
▽ More
Networks arise naturally in many scientific fields as a representation of pairwise connections. Statistical network analysis has most often considered a single large network, but it is common in a number of applications, for example, neuroimaging, to observe multiple networks on a shared node set. When these networks are grouped by case-control status or another categorical covariate, the classical statistical question of two-sample comparison arises. In this work, we address the problem of testing for statistically significant differences in a given arbitrary subset of connections. This general framework allows an analyst to focus on a single node, a specific region of interest, or compare whole networks. Our ability to conduct "mesoscale" testing on a meaningful group of edges is particularly relevant for applications such as neuroimaging and distinguishes our approach from prior work, which tends to focus either on a single node or the whole network. In this mesoscale setting, we develop statistically sound projection-based tests for two-sample comparison in both weighted and binary edge networks. Our approach can leverage all available network information, and learn informative projections which improve testing power when low-dimensional latent network structure is present.
△ Less
Submitted 22 October, 2024;
originally announced October 2024.
-
Autoregressive Networks with Dependent Edges
Authors:
Jinyuan Chang,
Qin Fang,
Eric D. Kolaczyk,
Peter W. MacDonald,
Qiwei Yao
Abstract:
We propose an autoregressive framework for modelling dynamic networks with dependent edges. It encompasses the models which accommodate, for example, transitivity, density-dependent and other stylized features often observed in real network data. By assuming the edges of network at each time are independent conditionally on their lagged values, the models, which exhibit a close connection with tem…
▽ More
We propose an autoregressive framework for modelling dynamic networks with dependent edges. It encompasses the models which accommodate, for example, transitivity, density-dependent and other stylized features often observed in real network data. By assuming the edges of network at each time are independent conditionally on their lagged values, the models, which exhibit a close connection with temporal ERGMs, facilitate both simulation and the maximum likelihood estimation in the straightforward manner. Due to the possible large number of parameters in the models, the initial MLEs may suffer from slow convergence rates. An improved estimator for each component parameter is proposed based on an iteration based on the projection which mitigates the impact of the other parameters (Chang et al., 2021, 2023). Based on a martingale difference structure, the asymptotic distribution of the improved estimator is derived without the stationarity assumption. The limiting distribution is not normal in general, and it reduces to normal when the underlying process satisfies some mixing conditions. Illustration with a transitivity model was carried out in both simulation and a real network data set.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
Latent process models for functional network data
Authors:
Peter W. MacDonald,
Elizaveta Levina,
Ji Zhu
Abstract:
Network data are often sampled with auxiliary information or collected through the observation of a complex system over time, leading to multiple network snapshots indexed by a continuous variable. Many methods in statistical network analysis are traditionally designed for a single network, and can be applied to an aggregated network in this setting, but that approach can miss important functional…
▽ More
Network data are often sampled with auxiliary information or collected through the observation of a complex system over time, leading to multiple network snapshots indexed by a continuous variable. Many methods in statistical network analysis are traditionally designed for a single network, and can be applied to an aggregated network in this setting, but that approach can miss important functional structure. Here we develop an approach to estimating the expected network explicitly as a function of a continuous index, be it time or another indexing variable. We parameterize the network expectation through low dimensional latent processes, whose components we represent with a fixed, finite-dimensional functional basis. We derive a gradient descent estimation algorithm, establish theoretical guarantees for recovery of the low dimensional structure, compare our method to competitors, and apply it to a data set of international political interactions over time, showing our proposed method to adapt well to data, outperform competitors, and provide interpretable and meaningful results.
△ Less
Submitted 15 July, 2024; v1 submitted 13 October, 2022;
originally announced October 2022.
-
Approximate Post-Selective Inference for Regression with the Group LASSO
Authors:
Snigdha Panigrahi,
Peter W. MacDonald,
Daniel Kessler
Abstract:
After selection with the Group LASSO (or generalized variants such as the overlapping, sparse, or standardized Group LASSO), inference for the selected parameters is unreliable in the absence of adjustments for selection bias. In the penalized Gaussian regression setup, existing approaches provide adjustments for selection events that can be expressed as linear inequalities in the data variables.…
▽ More
After selection with the Group LASSO (or generalized variants such as the overlapping, sparse, or standardized Group LASSO), inference for the selected parameters is unreliable in the absence of adjustments for selection bias. In the penalized Gaussian regression setup, existing approaches provide adjustments for selection events that can be expressed as linear inequalities in the data variables. Such a representation, however, fails to hold for selection with the Group LASSO and substantially obstructs the scope of subsequent post-selective inference. Key questions of inferential interest -- for example, inference for the effects of selected variables on the outcome -- remain unanswered. In the present paper, we develop a consistent, post-selective, Bayesian method to address the existing gaps by deriving a likelihood adjustment factor and an approximation thereof that eliminates bias from the selection of groups. Experiments on simulated data and data from the Human Connectome Project demonstrate that our method recovers the effects of parameters within the selected groups while paying only a small price for bias adjustment.
△ Less
Submitted 13 August, 2022; v1 submitted 31 December, 2020;
originally announced December 2020.
-
Latent space models for multiplex networks with shared structure
Authors:
Peter W. MacDonald,
Elizaveta Levina,
Ji Zhu
Abstract:
Latent space models are frequently used for modeling single-layer networks and include many popular special cases, such as the stochastic block model and the random dot product graph. However, they are not well-developed for more complex network structures, which are becoming increasingly common in practice. Here we propose a new latent space model for multiplex networks: multiple, heterogeneous n…
▽ More
Latent space models are frequently used for modeling single-layer networks and include many popular special cases, such as the stochastic block model and the random dot product graph. However, they are not well-developed for more complex network structures, which are becoming increasingly common in practice. Here we propose a new latent space model for multiplex networks: multiple, heterogeneous networks observed on a shared node set. Multiplex networks can represent a network sample with shared node labels, a network evolving over time, or a network with multiple types of edges. The key feature of our model is that it learns from data how much of the network structure is shared between layers and pools information across layers as appropriate. We establish identifiability, develop a fitting procedure using convex optimization in combination with a nuclear norm penalty, and prove a guarantee of recovery for the latent positions as long as there is sufficient separation between the shared and the individual latent subspaces. We compare the model to competing methods in the literature on simulated networks and on a multiplex network describing the worldwide trade of agricultural products.
△ Less
Submitted 7 July, 2021; v1 submitted 28 December, 2020;
originally announced December 2020.
-
Dynamic adaptive procedures that control the false discovery rate
Authors:
Peter MacDonald,
Kun Liang,
Arnold Janssen
Abstract:
In the multiple testing problem with independent tests, the classical linear step-up procedure controls the false discovery rate (FDR) at level $π_0α$, where $π_0$ is the proportion of true null hypotheses and $α$ is the target FDR level. Adaptive procedures can improve power by incorporating estimates of $π_0$, which typically rely on a tuning parameter. Fixed adaptive procedures set their tuning…
▽ More
In the multiple testing problem with independent tests, the classical linear step-up procedure controls the false discovery rate (FDR) at level $π_0α$, where $π_0$ is the proportion of true null hypotheses and $α$ is the target FDR level. Adaptive procedures can improve power by incorporating estimates of $π_0$, which typically rely on a tuning parameter. Fixed adaptive procedures set their tuning parameters before seeing the data and can be shown to control the FDR in finite samples. We develop theoretical results for dynamic adaptive procedures whose tuning parameters are determined by the data. We show that, if the tuning parameter is chosen according to a left-to-right stopping time rule, the corresponding dynamic adaptive procedure controls the FDR in finite samples. Examples include the recently proposed right-boundary procedure and the widely used lowest-slope procedure, among others. Simulation results show that the right-boundary procedure is more powerful than other dynamic adaptive procedures under independence and mild dependence conditions.
△ Less
Submitted 28 August, 2019; v1 submitted 6 December, 2017;
originally announced December 2017.
-
A Remark on Baserunning risk: Waiting Can Cost You the Game
Authors:
Peter MacDonald,
Dan McQuillan,
Ian McQuillan
Abstract:
We address the value of a baserunner at first base waiting to see if a ball in play falls in for a hit, before running. When a ball is hit in the air, the baserunner will usually wait, to gather additional information as to whether a ball will fall for a hit before deciding to run aggressively. This additional information guarantees that there will not be a double play and an "unnecessary out". Ho…
▽ More
We address the value of a baserunner at first base waiting to see if a ball in play falls in for a hit, before running. When a ball is hit in the air, the baserunner will usually wait, to gather additional information as to whether a ball will fall for a hit before deciding to run aggressively. This additional information guarantees that there will not be a double play and an "unnecessary out". However, waiting could potentially cost the runner the opportunity to reach third base, or even scoring on the play if the ball falls for a hit. This in turn affects the probability of scoring at least one run henceforth in the inning. We create a new statistic, the baserunning risk threshold (BRT), which measures the minimum probability with which the baserunner should be sure that a ball in play will fall in for a hit, before running without waiting to see if the ball will be caught, with the goal of scoring at least one run in the inning. We measure a 0-out and a 1-out version of BRT, both in aggregate, and also in high leverage situations, where scoring one run is particularly important. We show a drop in BRT for pitchers who pitch in more high leverage innings, and a very low BRT on average for "elite closers". It follows that baserunners should be frequently running without waiting, and getting thrown out in double plays regularly to maximize their chances of scoring at least one run.
△ Less
Submitted 3 May, 2015;
originally announced May 2015.