-
Posterior Inference in Latent Space for Scalable Constrained Black-box Optimization
Authors:
Kiyoung Om,
Kyuil Sim,
Taeyoung Yun,
Hyeongyu Kang,
Jinkyoo Park
Abstract:
Optimizing high-dimensional black-box functions under black-box constraints is a pervasive task in a wide range of scientific and engineering problems. These problems are typically harder than unconstrained problems due to hard-to-find feasible regions. While Bayesian optimization (BO) methods have been developed to solve such problems, they often struggle with the curse of dimensionality. Recentl…
▽ More
Optimizing high-dimensional black-box functions under black-box constraints is a pervasive task in a wide range of scientific and engineering problems. These problems are typically harder than unconstrained problems due to hard-to-find feasible regions. While Bayesian optimization (BO) methods have been developed to solve such problems, they often struggle with the curse of dimensionality. Recently, generative model-based approaches have emerged as a promising alternative for constrained optimization. However, they suffer from poor scalability and are vulnerable to mode collapse, particularly when the target distribution is highly multi-modal. In this paper, we propose a new framework to overcome these challenges. Our method iterates through two stages. First, we train flow-based models to capture the data distribution and surrogate models that predict both function values and constraint violations with uncertainty quantification. Second, we cast the candidate selection problem as a posterior inference problem to effectively search for promising candidates that have high objective values while not violating the constraints. During posterior inference, we find that the posterior distribution is highly multi-modal and has a large plateau due to constraints, especially when constraint feedback is given as binary indicators of feasibility. To mitigate this issue, we amortize the sampling from the posterior distribution in the latent space of flow-based models, which is much smoother than that in the data space. We empirically demonstrate that our method achieves superior performance on various synthetic and real-world constrained black-box optimization tasks. Our code is publicly available \href{https://github.com/umkiyoung/CiBO}{here}.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
Learning curves theory for hierarchically compositional data with power-law distributed features
Authors:
Francesco Cagnetta,
Hyunmo Kang,
Matthieu Wyart
Abstract:
Recent theories suggest that Neural Scaling Laws arise whenever the task is linearly decomposed into power-law distributed units. Alternatively, scaling laws also emerge when data exhibit a hierarchically compositional structure, as is thought to occur in language and images. To unify these views, we consider classification and next-token prediction tasks based on probabilistic context-free gramma…
▽ More
Recent theories suggest that Neural Scaling Laws arise whenever the task is linearly decomposed into power-law distributed units. Alternatively, scaling laws also emerge when data exhibit a hierarchically compositional structure, as is thought to occur in language and images. To unify these views, we consider classification and next-token prediction tasks based on probabilistic context-free grammars -- probabilistic models that generate data via a hierarchy of production rules. For classification, we show that having power-law distributed production rules results in a power-law learning curve with an exponent depending on the rules' distribution and a large multiplicative constant that depends on the hierarchical structure. By contrast, for next-token prediction, the distribution of production rules controls the local details of the learning curve, but not the exponent describing the large-scale behaviour.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Graph-based Semi-supervised and Unsupervised Methods for Local Clustering
Authors:
Zhaiming Shen,
Sung Ha Kang
Abstract:
Local clustering aims to identify specific substructures within a large graph without requiring full knowledge of the entire graph. These substructures are typically small compared to the overall graph, enabling the problem to be approached by finding a sparse solution to a linear system associated with the graph Laplacian. In this work, we first propose a method for identifying specific local clu…
▽ More
Local clustering aims to identify specific substructures within a large graph without requiring full knowledge of the entire graph. These substructures are typically small compared to the overall graph, enabling the problem to be approached by finding a sparse solution to a linear system associated with the graph Laplacian. In this work, we first propose a method for identifying specific local clusters when very few labeled data is given, which we term semi-supervised local clustering. We then extend this approach to the unsupervised setting when no prior information on labels is available. The proposed methods involve randomly sampling the graph, applying diffusion through local cluster extraction, then examining the overlap among the results to find each cluster. We establish the co-membership conditions for any pair of nodes and rigorously prove the correctness of our methods. Additionally, we conduct extensive experiments to demonstrate that the proposed methods achieve state-of-the-arts results in the low-label rates regime.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
Transfer Learning Between U.S. Presidential Elections: How Should We Learn From A 2020 Ad Campaign To Inform 2024 Ad Campaigns?
Authors:
Xinran Miao,
Jiwei Zhao,
Hyunseung Kang
Abstract:
For the 2024 U.S. presidential election, would negative, digital ads against Donald Trump impact voter turnout in Pennsylvania (PA), a key "tipping point'' state? The gold standard to address this question, a randomized experiment where voters get randomized to different ads, yields unbiased estimates of the ad effect, but is very expensive. Instead, we propose a less-than-ideal, but significantly…
▽ More
For the 2024 U.S. presidential election, would negative, digital ads against Donald Trump impact voter turnout in Pennsylvania (PA), a key "tipping point'' state? The gold standard to address this question, a randomized experiment where voters get randomized to different ads, yields unbiased estimates of the ad effect, but is very expensive. Instead, we propose a less-than-ideal, but significantly cheaper and faster framework based on transfer learning, where we transfer knowledge from a past ad experiment in 2020 to evaluate ads for 2024. A key component of our framework is a sensitivity analysis that quantifies the unobservable differences between 2020 and 2024 elections, where sensitivity parameters can be calibrated in a data-driven manner. We propose two estimators of the 2024 ad effect: a simple regression estimator with bootstrap, which we recommend for practitioners in this field, and an estimator based on the efficient influence function for broader applications. Using our framework, we estimate the effect of running a negative, digital ad campaign against Trump on voter turnout in PA for the 2024 election. Our findings indicate effect heterogeneity across counties of PA and among important subgroups stratified by gender, urbanicity, and education attainment.
△ Less
Submitted 12 March, 2025; v1 submitted 1 November, 2024;
originally announced November 2024.
-
Efficient estimation of semiparametric spatial point processes with V-fold random thinning
Authors:
Xindi Lin,
Hyunseung Kang
Abstract:
We study a broad class of models called semiparametric spatial point processes where the intensity function contains both a parametric component and a nonparametric component. We propose a novel estimator of the parametric component based on random thinning, a common sampling technique in point processes. The proposed estimator of the parametric component is shown to be consistent and asymptotical…
▽ More
We study a broad class of models called semiparametric spatial point processes where the intensity function contains both a parametric component and a nonparametric component. We propose a novel estimator of the parametric component based on random thinning, a common sampling technique in point processes. The proposed estimator of the parametric component is shown to be consistent and asymptotically normal if the nonparametric component can be estimated at the desired rate. We then extend a popular kernel-based estimator in i.i.d. settings and establish convergence rates that will enable inference for the parametric component. Next, we generalize the notion of semiparametric efficiency lower bound in i.i.d. settings to spatial point processes and show that the proposed estimator achieves the efficiency lower bound if the process is Poisson. Computationally, we show how to efficiently evaluate the proposed estimator with existing software for generalized partial linear models in i.i.d. settings by tailoring the sampling weights to replicate the dependence induced by the point process. We conclude with a small simulation study and a re-analysis of the spatial distribution of rainforest trees.
△ Less
Submitted 17 April, 2025; v1 submitted 6 October, 2024;
originally announced October 2024.
-
An Adaptive Importance Sampling for Locally Stable Point Processes
Authors:
Hee-Geon Kang,
Sunggon Kim
Abstract:
The problem of finding the expected value of a statistic of a locally stable point process in a bounded region is addressed. We propose an adaptive importance sampling for solving the problem. In our proposal, we restrict the importance point process to the family of homogeneous Poisson point processes, which enables us to generate quickly independent samples of the importance point process. The o…
▽ More
The problem of finding the expected value of a statistic of a locally stable point process in a bounded region is addressed. We propose an adaptive importance sampling for solving the problem. In our proposal, we restrict the importance point process to the family of homogeneous Poisson point processes, which enables us to generate quickly independent samples of the importance point process. The optimal intensity of the importance point process is found by applying the cross-entropy minimization method. In the proposed scheme, the expected value of the function and the optimal intensity are iteratively estimated in an adaptive manner. We show that the proposed estimator converges to the target value almost surely, and prove the asymptotic normality of it. We explain how to apply the proposed scheme to the estimation of the intensity of a stationary pairwise interaction point process. The performance of the proposed scheme is compared numerically with the Markov chain Monte Carlo simulation and the perfect sampling.
△ Less
Submitted 1 March, 2025; v1 submitted 14 August, 2024;
originally announced August 2024.
-
Identification and Inference with Invalid Instruments
Authors:
Hyunseung Kang,
Zijian Guo,
Zhonghua Liu,
Dylan Small
Abstract:
Instrumental variables (IVs) are widely used to study the causal effect of an exposure on an outcome in the presence of unmeasured confounding. IVs require an instrument, a variable that is (A1) associated with the exposure, (A2) has no direct effect on the outcome except through the exposure, and (A3) is not related to unmeasured confounders. Unfortunately, finding variables that satisfy conditio…
▽ More
Instrumental variables (IVs) are widely used to study the causal effect of an exposure on an outcome in the presence of unmeasured confounding. IVs require an instrument, a variable that is (A1) associated with the exposure, (A2) has no direct effect on the outcome except through the exposure, and (A3) is not related to unmeasured confounders. Unfortunately, finding variables that satisfy conditions (A2) or (A3) can be challenging in practice. This paper reviews works where instruments may not satisfy conditions (A2) or (A3), which we refer to as invalid instruments. We review identification and inference under different violations of (A2) or (A3), specifically under linear models, non-linear models, and heteroskedatic models. We conclude with an empirical comparison of various methods by re-analyzing the effect of body mass index on systolic blood pressure from the UK Biobank.
△ Less
Submitted 28 July, 2024;
originally announced July 2024.
-
Generalized Rosenbaum Bounds Sensitivity Analysis for Matched Observational Studies with Treatment Doses: Sufficiency, Consistency, and Efficiency
Authors:
Siyu Heng,
Hyunseung Kang
Abstract:
In matched observational studies with binary treatments, the Rosenbaum bounds framework is arguably the most widely used sensitivity analysis framework for assessing sensitivity to unobserved covariates. Unlike the binary treatment case, although widely needed in practice, sensitivity analysis for matched observational studies with treatment doses (i.e., non-binary treatments such as ordinal treat…
▽ More
In matched observational studies with binary treatments, the Rosenbaum bounds framework is arguably the most widely used sensitivity analysis framework for assessing sensitivity to unobserved covariates. Unlike the binary treatment case, although widely needed in practice, sensitivity analysis for matched observational studies with treatment doses (i.e., non-binary treatments such as ordinal treatments or continuous treatments) still lacks solid foundations and valid methodologies. We fill in this blank by establishing theoretical foundations and novel methodologies under a generalized Rosenbaum bounds sensitivity analysis framework. First, we present a criterion for assessing the validity of sensitivity analysis in matched observational studies with treatment doses and use that criterion to justify the necessity of incorporating the treatment dose information into sensitivity analysis through generalized Rosenbaum sensitivity bounds. We also generalize Rosenbaum's classic sensitivity parameter $Γ$ to the non-binary treatment case and prove its sufficiency. Second, we study the asymptotic properties of sensitivity analysis by generalizing Rosenbaum's classic design sensitivity and Bahadur efficiency for testing Fisher's sharp null to the non-binary treatment case and deriving novel formulas for them. Our theoretical results showed the importance of appropriately incorporating the treatment dose into a test, which is an intrinsic distinction with the binary treatment case. Third, for testing Neyman's weak null (i.e., null sample average treatment effect), we propose the first valid sensitivity analysis procedure for matching with treatment dose through generalizing an existing optimization-based sensitivity analysis for the binary treatment case, built on the generalized Rosenbaum sensitivity bounds and large-scale mixed integer programming.
△ Less
Submitted 23 March, 2024; v1 submitted 21 March, 2024;
originally announced March 2024.
-
A More Robust Approach to Multivariable Mendelian Randomization
Authors:
Yinxiang Wu,
Hyunseung Kang,
Ting Ye
Abstract:
Multivariable Mendelian randomization (MVMR) uses genetic variants as instrumental variables to infer the direct effects of multiple exposures on an outcome. However, unlike univariable Mendelian randomization, MVMR often faces greater challenges with many weak instruments, which can lead to bias not necessarily toward zero and inflation of type I errors. In this work, we introduce a new asymptoti…
▽ More
Multivariable Mendelian randomization (MVMR) uses genetic variants as instrumental variables to infer the direct effects of multiple exposures on an outcome. However, unlike univariable Mendelian randomization, MVMR often faces greater challenges with many weak instruments, which can lead to bias not necessarily toward zero and inflation of type I errors. In this work, we introduce a new asymptotic regime that allows exposures to have varying degrees of instrument strength, providing a more accurate theoretical framework for studying MVMR estimators. Under this regime, our analysis of the widely used multivariable inverse-variance weighted method shows that it is often biased and tends to produce misleadingly narrow confidence intervals in the presence of many weak instruments. To address this, we propose a simple, closed-form modification to the multivariable inverse-variance weighted estimator to reduce bias from weak instruments, and additionally introduce a novel spectral regularization technique to improve finite-sample performance. We show that the resulting spectral-regularized estimator remains consistent and asymptotically normal under many weak instruments. Through simulations and real data applications, we demonstrate that our proposed estimator and asymptotic framework can enhance the robustness of MVMR analyses.
△ Less
Submitted 12 June, 2025; v1 submitted 31 January, 2024;
originally announced February 2024.
-
A scalable two-stage Bayesian approach accounting for exposure measurement error in environmental epidemiology
Authors:
Changwoo J. Lee,
Elaine Symanski,
Amal Rammah,
Dong Hun Kang,
Philip K. Hopke,
Eun Sug Park
Abstract:
Accounting for exposure measurement errors has been recognized as a crucial problem in environmental epidemiology for over two decades. Bayesian hierarchical models offer a coherent probabilistic framework for evaluating associations between environmental exposures and health effects, which take into account exposure measurement errors introduced by uncertainty in the estimated exposure as well as…
▽ More
Accounting for exposure measurement errors has been recognized as a crucial problem in environmental epidemiology for over two decades. Bayesian hierarchical models offer a coherent probabilistic framework for evaluating associations between environmental exposures and health effects, which take into account exposure measurement errors introduced by uncertainty in the estimated exposure as well as spatial misalignment between the exposure and health outcome data. While two-stage Bayesian analyses are often regarded as a good alternative to fully Bayesian analyses when joint estimation is not feasible, there has been minimal research on how to properly propagate uncertainty from the first-stage exposure model to the second-stage health model, especially in the case of a large number of participant locations along with spatially correlated exposures. We propose a scalable two-stage Bayesian approach, called a sparse multivariate normal (sparse MVN) prior approach, based on the Vecchia approximation for assessing associations between exposure and health outcomes in environmental epidemiology. We compare its performance with existing approaches through simulation. Our sparse MVN prior approach shows comparable performance with the fully Bayesian approach, which is a gold standard but is impossible to implement in some cases. We investigate the association between source-specific exposures and pollutant (nitrogen dioxide (NO$_2$))-specific exposures and birth outcomes for 2012 in Harris County, Texas, using several approaches, including the newly developed method.
△ Less
Submitted 13 January, 2024; v1 submitted 31 December, 2023;
originally announced January 2024.
-
On the Temperature of Bayesian Graph Neural Networks for Conformal Prediction
Authors:
Seohyeon Cha,
Honggu Kang,
Joonhyuk Kang
Abstract:
Accurate uncertainty quantification in graph neural networks (GNNs) is essential, especially in high-stakes domains where GNNs are frequently employed. Conformal prediction (CP) offers a promising framework for quantifying uncertainty by providing $\textit{valid}$ prediction sets for any black-box model. CP ensures formal probabilistic guarantees that a prediction set contains a true label with a…
▽ More
Accurate uncertainty quantification in graph neural networks (GNNs) is essential, especially in high-stakes domains where GNNs are frequently employed. Conformal prediction (CP) offers a promising framework for quantifying uncertainty by providing $\textit{valid}$ prediction sets for any black-box model. CP ensures formal probabilistic guarantees that a prediction set contains a true label with a desired probability. However, the size of prediction sets, known as $\textit{inefficiency}$, is influenced by the underlying model and data generating process. On the other hand, Bayesian learning also provides a credible region based on the estimated posterior distribution, but this region is $\textit{well-calibrated}$ only when the model is correctly specified. Building on a recent work that introduced a scaling parameter for constructing valid credible regions from posterior estimate, our study explores the advantages of incorporating a temperature parameter into Bayesian GNNs within CP framework. We empirically demonstrate the existence of temperatures that result in more efficient prediction sets. Furthermore, we conduct an analysis to identify the factors contributing to inefficiency and offer valuable insights into the relationship between CP performance and model calibration.
△ Less
Submitted 3 December, 2023; v1 submitted 17 October, 2023;
originally announced October 2023.
-
Fully Latent Principal Stratification With Measurement Models
Authors:
Sooyong Lee,
Adam C Sales,
Hyeon-Ah Kang,
Tiffany A. Whittaker
Abstract:
There is wide agreement on the importance of implementation data from randomized effectiveness studies in behavioral science; however, there are few methods available to incorporate these data into causal models, especially when they are multivariate or longitudinal, and interest is in low-dimensional summaries. We introduce a framework for studying how treatment effects vary between subjects who…
▽ More
There is wide agreement on the importance of implementation data from randomized effectiveness studies in behavioral science; however, there are few methods available to incorporate these data into causal models, especially when they are multivariate or longitudinal, and interest is in low-dimensional summaries. We introduce a framework for studying how treatment effects vary between subjects who implement an intervention differently, combining principal stratification with latent variable measurement models; since principal strata are latent in both treatment arms, we call it "fully-latent principal stratification" or FLPS. We describe FLPS models including item-response-theory measurement, show that they are feasible in a simulation study, and illustrate them in an analysis of hint usage from a randomized study of computerized mathematics tutors.
△ Less
Submitted 15 May, 2024; v1 submitted 7 September, 2023;
originally announced September 2023.
-
Bayesian Causal Forests & the 2022 ACIC Data Challenge: Scalability and Sensitivity
Authors:
Ajinkya H. Kokandakar,
Hyunseung Kang,
Sameer K. Deshpande
Abstract:
We demonstrate how Hahn et al.'s Bayesian Causal Forests model (BCF) can be used to estimate conditional average treatment effects for the longitudinal dataset in the 2022 American Causal Inference Conference Data Challenge. Unfortunately, existing implementations of BCF do not scale to the size of the challenge data. Therefore, we developed flexBCF -- a more scalable and flexible implementation o…
▽ More
We demonstrate how Hahn et al.'s Bayesian Causal Forests model (BCF) can be used to estimate conditional average treatment effects for the longitudinal dataset in the 2022 American Causal Inference Conference Data Challenge. Unfortunately, existing implementations of BCF do not scale to the size of the challenge data. Therefore, we developed flexBCF -- a more scalable and flexible implementation of BCF -- and used it in our challenge submission. We investigate the sensitivity of our results to the choice of propensity score estimation method and the use of sparsity-inducing regression tree priors. While we found that our overall point predictions were not especially sensitive to these modeling choices, we did observe that running BCF with flexibly estimated propensity scores often yielded better-calibrated uncertainty intervals.
△ Less
Submitted 11 May, 2023; v1 submitted 3 November, 2022;
originally announced November 2022.
-
Propensity Score Modeling: Key Challenges When Moving Beyond the No-Interference Assumption
Authors:
Hyunseung Kang,
Chan Park,
Ralph Trane
Abstract:
The paper presents some models for the propensity score. Considerable attention is given to a recently popular, but relatively under-explored setting in causal inference where the no-interference assumption does not hold. We lay out some key challenges in propensity score modeling under interference and present a few promising models based on existing works on mixed effects models.
The paper presents some models for the propensity score. Considerable attention is given to a recently popular, but relatively under-explored setting in causal inference where the no-interference assumption does not hold. We lay out some key challenges in propensity score modeling under interference and present a few promising models based on existing works on mixed effects models.
△ Less
Submitted 12 August, 2022;
originally announced August 2022.
-
Semiparametric Efficient Dimension Reduction in multivariate regression with an Inner Envelope
Authors:
Linquan Ma,
Hyunseung Kang,
Lan Liu
Abstract:
Recently, Su and Cook proposed a dimension reduction technique called the inner envelope which can be substantially more efficient than the original envelope or existing dimension reduction techniques for multivariate regression. However, their technique relied on a linear model with normally distributed error, which may be violated in practice. In this work, we propose a semiparametric variant of…
▽ More
Recently, Su and Cook proposed a dimension reduction technique called the inner envelope which can be substantially more efficient than the original envelope or existing dimension reduction techniques for multivariate regression. However, their technique relied on a linear model with normally distributed error, which may be violated in practice. In this work, we propose a semiparametric variant of the inner envelope that does not rely on the linear model nor the normality assumption. We show that our proposal leads to globally and locally efficient estimators of the inner envelope spaces. We also present a computationally tractable algorithm to estimate the inner envelope. Our simulations and real data analysis show that our method is both robust and efficient compared to existing dimension reduction methods in a diverse array of settings.
△ Less
Submitted 23 May, 2022;
originally announced May 2022.
-
A Robust, Differentially Private Randomized Experiment for Evaluating Online Educational Programs With Sensitive Student Data
Authors:
Manjusha Kancharla,
Hyunseung Kang
Abstract:
Randomized control trials (RCTs) have been the gold standard to evaluate the effectiveness of a program, policy, or treatment on an outcome of interest. However, many RCTs assume that study participants are willing to share their (potentially sensitive) data, specifically their response to treatment. This assumption, while trivial at first, is becoming difficult to satisfy in the modern era, espec…
▽ More
Randomized control trials (RCTs) have been the gold standard to evaluate the effectiveness of a program, policy, or treatment on an outcome of interest. However, many RCTs assume that study participants are willing to share their (potentially sensitive) data, specifically their response to treatment. This assumption, while trivial at first, is becoming difficult to satisfy in the modern era, especially in online settings where there are more regulations to protect individuals' data. The paper presents a new, simple experimental design that is differentially private, one of the strongest notions of data privacy. Also, using works on noncompliance in experimental psychology, we show that our design is robust against "adversarial" participants who may distrust investigators with their personal data and provide contaminated responses to intentionally bias the results of the experiment. Under our new design, we propose unbiased and asymptotically Normal estimators for the average treatment effect. We also present a doubly robust, covariate-adjusted estimator that uses pre-treatment covariates (if available) to improve efficiency. We conclude by using the proposed experimental design to evaluate the effectiveness of online statistics courses at the University of Wisconsin-Madison during the Spring 2021 semester, where many classes were online due to COVID-19.
△ Less
Submitted 4 December, 2021;
originally announced December 2021.
-
Minimum Resource Threshold Policy Under Partial Interference
Authors:
Chan Park,
Guanhua Chen,
Menggang Yu,
Hyunseung Kang
Abstract:
When developing policies for prevention of infectious diseases, policymakers often set specific, outcome-oriented targets to achieve. For example, when developing a vaccine allocation policy, policymakers may want to distribute them so that at least a certain fraction of individuals in a census block are disease-free and spillover effects due to interference within blocks are accounted for. The pa…
▽ More
When developing policies for prevention of infectious diseases, policymakers often set specific, outcome-oriented targets to achieve. For example, when developing a vaccine allocation policy, policymakers may want to distribute them so that at least a certain fraction of individuals in a census block are disease-free and spillover effects due to interference within blocks are accounted for. The paper proposes methods to estimate a block-level treatment policy that achieves a pre-defined, outcome-oriented target while accounting for spillover effects due to interference. Our policy, the minimum resource threshold policy (MRTP), suggests the minimum fraction of treated units required within a block to meet or exceed the target level of the outcome. We estimate the MRTP from empirical risk minimization using a novel, nonparametric, doubly robust loss function. We then characterize statistical properties of the estimated MRTP in terms of the excess risk bound. We apply our methodology to design a water, sanitation, and hygiene allocation policy for Senegal with the goal of increasing the proportion of households with no children experiencing diarrhea to a level exceeding a specified threshold. Our policy outperforms competing policies and offers new approaches to design allocation policies, especially in international development for communicable diseases.
△ Less
Submitted 23 October, 2023; v1 submitted 18 November, 2021;
originally announced November 2021.
-
A More Efficient, Doubly Robust, Nonparametric Estimator of Treatment Effects in Multilevel Studies
Authors:
Chan Park,
Hyunseung Kang
Abstract:
When studying treatment effects in multilevel studies, investigators commonly use (semi-)parametric estimators, which make strong parametric assumptions about the outcome, the treatment, and/or the correlation structure between study units in a cluster. We propose a novel estimator of treatment effects that does not make such assumptions. Specifically, the new estimator is shown to be doubly robus…
▽ More
When studying treatment effects in multilevel studies, investigators commonly use (semi-)parametric estimators, which make strong parametric assumptions about the outcome, the treatment, and/or the correlation structure between study units in a cluster. We propose a novel estimator of treatment effects that does not make such assumptions. Specifically, the new estimator is shown to be doubly robust, asymptotically Normal, and often more efficient than existing estimators, all without having to make any parametric modeling assumptions about the outcome, the treatment, and the correlation structure. We achieve this by estimating two non-standard nuisance functions in causal inference, the conditional propensity score and the outcome covariance model, using existing existing machine learning methods designed for independent and identically distributed (i.i.d) data. The new estimator is also demonstrated in simulated and real data where the new estimator is drastically more efficient than existing estimators, especially when studying cluster-specific treatment effects.
△ Less
Submitted 10 May, 2022; v1 submitted 14 October, 2021;
originally announced October 2021.
-
Yield Spread Selection in Predicting Recession Probabilities: A Machine Learning Approach
Authors:
Jaehyuk Choi,
Desheng Ge,
Kyu Ho Kang,
Sungbin Sohn
Abstract:
The literature on using yield curves to forecast recessions customarily uses 10-year--three-month Treasury yield spread without verification on the pair selection. This study investigates whether the predictive ability of spread can be improved by letting a machine learning algorithm identify the best maturity pair and coefficients. Our comprehensive analysis shows that, despite the likelihood gai…
▽ More
The literature on using yield curves to forecast recessions customarily uses 10-year--three-month Treasury yield spread without verification on the pair selection. This study investigates whether the predictive ability of spread can be improved by letting a machine learning algorithm identify the best maturity pair and coefficients. Our comprehensive analysis shows that, despite the likelihood gain, the machine learning approach does not significantly improve prediction, owing to the estimation error. This is robust to the forecasting horizon, control variable, sample period, and oversampling of the recession observations. Our finding supports the use of the 10-year--three-month spread.
△ Less
Submitted 5 January, 2022; v1 submitted 22 January, 2021;
originally announced January 2021.
-
Assumption-Lean Analysis of Cluster Randomized Trials in Infectious Diseases for Intent-to-Treat Effects and Network Effects
Authors:
Chan Park,
Hyunseung Kang
Abstract:
Cluster randomized trials (CRTs) are a popular design to study the effect of interventions in infectious disease settings. However, standard analysis of CRTs primarily relies on strong parametric methods, usually mixed-effect models to account for the clustering structure, and focuses on the overall intent-to-treat (ITT) effect to evaluate effectiveness. The paper presents two assumption-lean meth…
▽ More
Cluster randomized trials (CRTs) are a popular design to study the effect of interventions in infectious disease settings. However, standard analysis of CRTs primarily relies on strong parametric methods, usually mixed-effect models to account for the clustering structure, and focuses on the overall intent-to-treat (ITT) effect to evaluate effectiveness. The paper presents two assumption-lean methods to analyze two types of effects in CRTs, ITT effects and network effects among well-known compliance groups. For the ITT effects, we study the overall and the heterogeneous ITT effects among the observed covariates where we do not impose parametric models or asymptotic restrictions on cluster size. For the network effects among compliance groups, we propose a new bound-based method that uses pre-treatment covariates, classification algorithms, and a linear program to obtain sharp bounds. A key feature of our method is that the bounds can become narrower as the classification algorithm improves and the method may also be useful for studies of partial identification with instrumental variables. We conclude by reanalyzing a CRT studying the effect of face masks and hand sanitizers on transmission of 2008 interpandemic influenza in Hong Kong.
△ Less
Submitted 22 September, 2021; v1 submitted 27 December, 2020;
originally announced December 2020.
-
Two Robust Tools for Inference about Causal Effects with Invalid Instruments
Authors:
Hyunseung Kang,
Youjin Lee,
T. Tony Cai,
Dylan S. Small
Abstract:
Instrumental variables have been widely used to estimate the causal effect of a treatment on an outcome. Existing confidence intervals for causal effects based on instrumental variables assume that all of the putative instrumental variables are valid; a valid instrumental variable is a variable that affects the outcome only by affecting the treatment and is not related to unmeasured confounders. H…
▽ More
Instrumental variables have been widely used to estimate the causal effect of a treatment on an outcome. Existing confidence intervals for causal effects based on instrumental variables assume that all of the putative instrumental variables are valid; a valid instrumental variable is a variable that affects the outcome only by affecting the treatment and is not related to unmeasured confounders. However, in practice, some of the putative instrumental variables are likely to be invalid. This paper presents two tools to conduct valid inference and tests in the presence of invalid instruments. First, we propose a simple and general approach to construct confidence intervals based on taking unions of well-known confidence intervals. Second, we propose a novel test for the null causal effect based on a collider bias. Our two proposals, especially when fused together, outperform traditional instrumental variable confidence intervals when invalid instruments are present, and can also be used as a sensitivity analysis when there is concern that instrumental variables assumptions are violated. The new approach is applied to a Mendelian randomization study on the causal effect of low-density lipoprotein on the incidence of cardiovascular diseases.
△ Less
Submitted 2 June, 2020;
originally announced June 2020.
-
Efficient Semiparametric Estimation of Network Treatment Effects Under Partial Interference
Authors:
Chan Park,
Hyunseung Kang
Abstract:
Recently, many estimators for network treatment effects have been proposed. But, their optimality properties in terms of semiparametric efficiency have yet to be resolved. We present a simple, yet flexible asymptotic framework to derive the efficient influence function and the semiparametric efficiency lower bound for a family of network causal effects under partial interference. An important coro…
▽ More
Recently, many estimators for network treatment effects have been proposed. But, their optimality properties in terms of semiparametric efficiency have yet to be resolved. We present a simple, yet flexible asymptotic framework to derive the efficient influence function and the semiparametric efficiency lower bound for a family of network causal effects under partial interference. An important corollary of our results is that one of the existing estimators by Liu et al. (2019) is locally efficient. We also present other estimators that are efficient and discuss results on adaptive estimation. We conclude by using the efficient estimators to study the direct and spillover effects of conditional cash transfer programs in Colombia.
△ Less
Submitted 24 November, 2021; v1 submitted 19 April, 2020;
originally announced April 2020.
-
Inferring Treatment Effects After Testing Instrument Strength in Linear Models
Authors:
Nan Bi,
Hyunseung Kang,
Jonathan Taylor
Abstract:
A common practice in IV studies is to check for instrument strength, i.e. its association to the treatment, with an F-test from regression. If the F-statistic is above some threshold, usually 10, the instrument is deemed to satisfy one of the three core IV assumptions and used to test for the treatment effect. However, in many cases, the inference on the treatment effect does not take into account…
▽ More
A common practice in IV studies is to check for instrument strength, i.e. its association to the treatment, with an F-test from regression. If the F-statistic is above some threshold, usually 10, the instrument is deemed to satisfy one of the three core IV assumptions and used to test for the treatment effect. However, in many cases, the inference on the treatment effect does not take into account the strength test done a priori. In this paper, we show that not accounting for this pretest can severely distort the distribution of the test statistic and propose a method to correct this distortion, producing valid inference. A key insight in our method is to frame the F-test as a randomized convex optimization problem and to leverage recent methods in selective inference. We prove that our method provides conditional and marginal Type I error control. We also extend our method to weak instrument settings. We conclude with a reanalysis of studies concerning the effect of education on earning where we show that not accounting for pre-testing can dramatically alter the original conclusion about education's effects.
△ Less
Submitted 14 March, 2020;
originally announced March 2020.
-
ivmodel: An R Package for Inference and Sensitivity Analysis of Instrumental Variables Models with One Endogenous Variable
Authors:
Hyunseung Kang,
Yang Jiang,
Qingyuan Zhao,
Dylan S. Small
Abstract:
We present a comprehensive R software ivmodel for analyzing instrumental variables with one endogenous variable. The package implements a general class of estimators called k- class estimators and two confidence intervals that are fully robust to weak instruments. The package also provides power formulas for various test statistics in instrumental variables. Finally, the package contains methods f…
▽ More
We present a comprehensive R software ivmodel for analyzing instrumental variables with one endogenous variable. The package implements a general class of estimators called k- class estimators and two confidence intervals that are fully robust to weak instruments. The package also provides power formulas for various test statistics in instrumental variables. Finally, the package contains methods for sensitivity analysis to examine the sensitivity of the inference to instrumental variables assumptions. We demonstrate the software on the data set from Card (1995), looking at the causal effect of levels of education on log earnings where the instrument is proximity to a four-year college.
△ Less
Submitted 7 July, 2020; v1 submitted 19 February, 2020;
originally announced February 2020.
-
Debiased Inverse-Variance Weighted Estimator in Two-Sample Summary-Data Mendelian Randomization
Authors:
Ting Ye,
Jun Shao,
Hyunseung Kang
Abstract:
Mendelian randomization (MR) has become a popular approach to study the effect of a modifiable exposure on an outcome by using genetic variants as instrumental variables. A challenge in MR is that each genetic variant explains a relatively small proportion of variance in the exposure and there are many such variants, a setting known as many weak instruments. To this end, we provide a theoretical c…
▽ More
Mendelian randomization (MR) has become a popular approach to study the effect of a modifiable exposure on an outcome by using genetic variants as instrumental variables. A challenge in MR is that each genetic variant explains a relatively small proportion of variance in the exposure and there are many such variants, a setting known as many weak instruments. To this end, we provide a theoretical characterization of the statistical properties of two popular estimators in MR, the inverse-variance weighted (IVW) estimator and the IVW estimator with screened instruments using an independent selection dataset, under many weak instruments. We then propose a debiased IVW estimator, a simple modification of the IVW estimator, that is robust to many weak instruments and doesn't require screening. Additionally, we present two instrument selection methods to improve the efficiency of the new estimator when a selection dataset is available. An extension of the debiased IVW estimator to handle balanced horizontal pleiotropy is also discussed. We conclude by demonstrating our results in simulated and real datasets.
△ Less
Submitted 10 October, 2020; v1 submitted 21 November, 2019;
originally announced November 2019.
-
Inference After Selecting Plausibly Valid Instruments with Application to Mendelian Randomization
Authors:
Nan Bi,
Hyunseung Kang,
Jonathan Taylor
Abstract:
Mendelian randomization (MR) is a popular method in genetic epidemiology to estimate the effect of an exposure on an outcome by using genetic instruments. These instruments are often selected from a combination of prior knowledge from genome wide association studies (GWAS) and data-driven instrument selection procedures or tests. Unfortunately, when testing for the exposure effect, the instrument…
▽ More
Mendelian randomization (MR) is a popular method in genetic epidemiology to estimate the effect of an exposure on an outcome by using genetic instruments. These instruments are often selected from a combination of prior knowledge from genome wide association studies (GWAS) and data-driven instrument selection procedures or tests. Unfortunately, when testing for the exposure effect, the instrument selection process done a priori is not accounted for. This paper studies and highlights the bias resulting from not accounting for the instrument selection process by focusing on a recent data-driven instrument selection procedure, sisVIVE, as an example. We introduce a conditional inference approach that conditions on the instrument selection done a priori and leverage recent advances in selective inference to derive conditional null distributions of popular test statistics for the exposure effect in MR. The null distributions can be characterized with individual-level or summary-level data in MR. We show that our conditional confidence intervals derived from conditional null distributions attain the desired nominal level while typical confidence intervals computed in MR do not. We conclude by reanalyzing the effect of BMI on diastolic blood pressure using summary-level data from the UKBiobank that accounts for instrument selection.
△ Less
Submitted 10 November, 2019;
originally announced November 2019.
-
Weak-Instrument Robust Tests in Two-Sample Summary-Data Mendelian Randomization
Authors:
Sheng Wang,
Hyunseung Kang
Abstract:
Mendelian randomization (MR) has been a popular method in genetic epidemiology to estimate the effect of an exposure on an outcome using genetic variants as instrumental variables (IV), with two-sample summary-data MR being the most popular. Unfortunately, instruments in MR studies are often weakly associated with the exposure, which can bias effect estimates and inflate Type I errors. In this wor…
▽ More
Mendelian randomization (MR) has been a popular method in genetic epidemiology to estimate the effect of an exposure on an outcome using genetic variants as instrumental variables (IV), with two-sample summary-data MR being the most popular. Unfortunately, instruments in MR studies are often weakly associated with the exposure, which can bias effect estimates and inflate Type I errors. In this work, we propose test statistics that are robust under weak instrument asymptotics by extending the Anderson-Rubin, Kleibergen, and the conditional likelihood ratio test in econometrics to two-sample summary-data MR. We also use the proposed Anderson-Rubin test to develop a point estimator and to detect invalid instruments. We conclude with a simulation and an empirical study and show that the proposed tests control size and have better power than existing methods with weak instruments.
△ Less
Submitted 7 June, 2021; v1 submitted 15 September, 2019;
originally announced September 2019.
-
Mature GAIL: Imitation Learning for Low-level and High-dimensional Input using Global Encoder and Cost Transformation
Authors:
Wonsup Shin,
Hyolim Kang,
Sunghoon Hong
Abstract:
Recently, GAIL framework and various variants have shown remarkable possibilities for solving practical MDP problems. However, detailed researches of low-level, and high-dimensional state input in this framework, such as image sequences, has not been conducted. Furthermore, the cost function learned in the traditional GAIL frame-work only lies on a negative range, acting as a non-penalized reward…
▽ More
Recently, GAIL framework and various variants have shown remarkable possibilities for solving practical MDP problems. However, detailed researches of low-level, and high-dimensional state input in this framework, such as image sequences, has not been conducted. Furthermore, the cost function learned in the traditional GAIL frame-work only lies on a negative range, acting as a non-penalized reward and making the agent difficult to learn the optimal policy. In this paper, we propose a new algorithm based on the GAIL framework that includes a global encoder and the reward penalization mechanism. The global encoder solves two issues that arise when applying GAIL framework to high-dimensional image state. Also, it is shown that the penalization mechanism provides more adequate reward to the agent, resulting in stable performance improvement. Our approach's potential can be backed up by the fact that it is generally applicable to variants of GAIL framework. We conducted in-depth experiments by applying our methods to various variants of the GAIL framework. And, the results proved that our method significantly improves the performances when it comes to low-level and high-dimensional tasks.
△ Less
Submitted 7 September, 2019;
originally announced September 2019.
-
A Groupwise Approach for Inferring Heterogeneous Treatment Effects in Causal Inference
Authors:
Chan Park,
Hyunseung Kang
Abstract:
Recently, there has been great interest in estimating the conditional average treatment effect using flexible machine learning methods. However, in practice, investigators often have working hypotheses about effect heterogeneity across pre-defined subgroups of study units, which we call the groupwise approach. The paper compares two modern ways to estimate groupwise treatment effects, a nonparamet…
▽ More
Recently, there has been great interest in estimating the conditional average treatment effect using flexible machine learning methods. However, in practice, investigators often have working hypotheses about effect heterogeneity across pre-defined subgroups of study units, which we call the groupwise approach. The paper compares two modern ways to estimate groupwise treatment effects, a nonparametric approach and a semiparametric approach, with the goal of better informing practice. Specifically, we compare (a) the underlying assumptions, (b) efficiency and adaption to the underlying data generating models, and (c) a way to combine the two approaches. We also discuss how to test a key assumption concerning the semiparametric estimator and to obtain cluster-robust standard errors if study units in the same subgroups are correlated. We demonstrate our findings by conducting simulation studies and reanalyzing the Early Childhood Longitudinal Study.
△ Less
Submitted 11 September, 2023; v1 submitted 12 August, 2019;
originally announced August 2019.
-
Detecting Heterogeneous Treatment Effect with Instrumental Variables
Authors:
Michael Johnson,
Jiongyi Cao,
Hyunseung Kang
Abstract:
There is an increasing interest in estimating heterogeneity in causal effects in randomized and observational studies. However, little research has been conducted to understand heterogeneity in an instrumental variables study. In this work, we present a method to estimate heterogeneous causal effects using an instrumental variable approach. The method has two parts. The first part uses subject-mat…
▽ More
There is an increasing interest in estimating heterogeneity in causal effects in randomized and observational studies. However, little research has been conducted to understand heterogeneity in an instrumental variables study. In this work, we present a method to estimate heterogeneous causal effects using an instrumental variable approach. The method has two parts. The first part uses subject-matter knowledge and interpretable machine learning techniques, such as classification and regression trees, to discover potential effect modifiers. The second part uses closed testing to test for the statistical significance of the effect modifiers while strongly controlling familywise error rate. We conducted this method on the Oregon Health Insurance Experiment, estimating the effect of Medicaid on the number of days an individual's health does not impede their usual activities, and found evidence of heterogeneity in older men who prefer English and don't self-identify as Asian and younger individuals who have at most a high school diploma or GED and prefer English.
△ Less
Submitted 19 January, 2021; v1 submitted 9 August, 2019;
originally announced August 2019.
-
Increasing Power for Observational Studies of Aberrant Response: An Adaptive Approach
Authors:
Siyu Heng,
Hyunseung Kang,
Dylan S. Small,
Colin B. Fogarty
Abstract:
In many observational studies, the interest is in the effect of treatment on bad, aberrant outcomes rather than the average outcome. For such settings, the traditional approach is to define a dichotomous outcome indicating aberration from a continuous score and use the Mantel-Haenszel test with matched data. For example, studies of determinants of poor child growth use the World Health Organizatio…
▽ More
In many observational studies, the interest is in the effect of treatment on bad, aberrant outcomes rather than the average outcome. For such settings, the traditional approach is to define a dichotomous outcome indicating aberration from a continuous score and use the Mantel-Haenszel test with matched data. For example, studies of determinants of poor child growth use the World Health Organization's definition of child stunting being height-for-age z-score $\leq -2$. The traditional approach may lose power because it discards potentially useful information about the severity of aberration. We develop an adaptive approach that makes use of this information and asymptotically dominates the traditional approach. We develop our approach in two parts. First, we develop an aberrant rank approach in matched observational studies and prove a novel design sensitivity formula enabling its asymptotic comparison with the Mantel-Haenszel test under various settings. Second, we develop a new, general adaptive approach, the two-stage programming method, and use it to adaptively combine the aberrant rank test and the Mantel-Haenszel test. We apply our approach to a study of the effect of teenage pregnancy on stunting.
△ Less
Submitted 14 October, 2020; v1 submitted 15 July, 2019;
originally announced July 2019.
-
Learning NP-Hard Multi-Agent Assignment Planning using GNN: Inference on a Random Graph and Provable Auction-Fitted Q-learning
Authors:
Hyunwook Kang,
Taehwan Kwon,
Jinkyoo Park,
James R. Morrison
Abstract:
This paper explores the possibility of near-optimally solving multi-agent, multi-task NP-hard planning problems with time-dependent rewards using a learning-based algorithm. In particular, we consider a class of robot/machine scheduling problems called the multi-robot reward collection problem (MRRC). Such MRRC problems well model ride-sharing, pickup-and-delivery, and a variety of related problem…
▽ More
This paper explores the possibility of near-optimally solving multi-agent, multi-task NP-hard planning problems with time-dependent rewards using a learning-based algorithm. In particular, we consider a class of robot/machine scheduling problems called the multi-robot reward collection problem (MRRC). Such MRRC problems well model ride-sharing, pickup-and-delivery, and a variety of related problems. In representing the MRRC problem as a sequential decision-making problem, we observe that each state can be represented as an extension of probabilistic graphical models (PGMs), which we refer to as random PGMs. We then develop a mean-field inference method for random PGMs. We then propose (1) an order-transferable Q-function estimator and (2) an order-transferability-enabled auction to select a joint assignment in polynomial time. These result in a reinforcement learning framework with at least $1-1/e$ optimality. Experimental results on solving MRRC problems highlight the near-optimality and transferability of the proposed methods. We also consider identical parallel machine scheduling problems (IPMS) and minimax multiple traveling salesman problems (minimax-mTSP).
△ Less
Submitted 13 August, 2023; v1 submitted 29 May, 2019;
originally announced May 2019.
-
Multi-Scale Fully Convolutional Network for Cardiac Left Ventricle Segmentation
Authors:
Han Kang,
Defeng Chen
Abstract:
The morphological structure of left ventricle segmented from cardiac magnetic resonance images can be used to calculate key clinical parameters, and it is of great significance to the accurate and efficient diagnosis of cardiovascular diseases. Compared with traditional methods, the segmentation algorithms based on fully convolutional neural network greatly improve the accuracy of semantic segment…
▽ More
The morphological structure of left ventricle segmented from cardiac magnetic resonance images can be used to calculate key clinical parameters, and it is of great significance to the accurate and efficient diagnosis of cardiovascular diseases. Compared with traditional methods, the segmentation algorithms based on fully convolutional neural network greatly improve the accuracy of semantic segmentation. For the problem of left ventricular segmentation, a new fully convolutional neural network structure named MS-FCN is proposed in this paper. The MS-FCN network employs a multi-scale pooling module to ensure that the network maximises the feature extraction ability and uses a dense connectivity decoder to refine the boundaries of the object. Based on the Sunnybrook cine-MR dataset provided by the MICCAI 2009 challenge, numerical experiments demonstrate that our proposed model has obtained state-of-the-art segmentation results: the Dice score of our method reaches 0.93 on the endocardium, and 0.96 on the epicardium.
△ Less
Submitted 19 September, 2018;
originally announced September 2018.
-
Spillover Effects in Cluster Randomized Trials with Noncompliance
Authors:
Hyunseung Kang,
Luke Keele
Abstract:
Cluster randomized trials (CRTs) are popular in public health and in the social sciences to evaluate a new treatment or policy where the new policy is randomly allocated to clusters of units rather than individual units. CRTs often feature both noncompliance, when individuals within a cluster are not exposed to the intervention, and individuals within a cluster may influence each other through tre…
▽ More
Cluster randomized trials (CRTs) are popular in public health and in the social sciences to evaluate a new treatment or policy where the new policy is randomly allocated to clusters of units rather than individual units. CRTs often feature both noncompliance, when individuals within a cluster are not exposed to the intervention, and individuals within a cluster may influence each other through treatment spillovers where those who comply with the new policy may affect the outcomes of those who do not. Here, we study the identification of causal effects in CRTs when both noncompliance and treatment spillovers are present. We prove that the standard analysis of CRT data with noncompliance using instrumental variables does not identify the usual complier average causal effect when treatment spillovers are present. We extend this result and show that no analysis of CRT data can unbiasedly estimate local network causal effects. Finally, we develop bounds for these causal effects under the assumption that the treatment is not harmful compared to the control. We demonstrate these results with an empirical study of a deworming intervention in Kenya.
△ Less
Submitted 14 August, 2019; v1 submitted 20 August, 2018;
originally announced August 2018.
-
Estimation Methods for Cluster Randomized Trials with Noncompliance: A Study of A Biometric Smartcard Payment System in India
Authors:
Hyunseung Kang,
Luke Keele
Abstract:
Many policy evaluations occur in settings where treatment is randomized at the cluster level, and there is treatment noncompliance within each cluster. For example, villages might be assigned to treatment and control, but residents in each village may choose to comply or not with their assigned treatment status. When noncompliance is present, the instrumental variables framework can be used to ide…
▽ More
Many policy evaluations occur in settings where treatment is randomized at the cluster level, and there is treatment noncompliance within each cluster. For example, villages might be assigned to treatment and control, but residents in each village may choose to comply or not with their assigned treatment status. When noncompliance is present, the instrumental variables framework can be used to identify and estimate causal effects. While a large literature exists on instrumental variables estimation methods, relatively little work has been focused on settings with clustered treatments. Here, we review extant methods for instrumental variable estimation in clustered designs and derive both the finite and asymptotic properties of these estimators. We prove that the properties of current estimators depend on unrealistic assumptions. We then develop a new IV estimation method for cluster randomized trials and study its formal properties. We prove that our IV estimator allows for possible treatment effect heterogeneity that is correlated with cluster size and is robust to low compliance rates within clusters. We evaluate these methods using simulations and apply them to data from a randomized intervention in India.
△ Less
Submitted 14 August, 2019; v1 submitted 9 May, 2018;
originally announced May 2018.
-
Accurate and Efficient Estimation of Small P-values with the Cross-Entropy Method: Applications in Genomic Data Analysis
Authors:
Yang Shi,
Mengqiao Wang,
Weiping Shi,
Ji-Hyun Lee,
Huining Kang,
Hui Jiang
Abstract:
$\textbf{Motivation:}$ Small $p…
▽ More
$\textbf{Motivation:}$ Small $p$-values are often required to be accurately estimated in large-scale genomic studies for the adjustment of multiple hypothesis tests and the ranking of genomic features based on their statistical significance. For those complicated test statistics whose cumulative distribution functions are analytically intractable, existing methods usually do not work well with small $p$-values due to lack of accuracy or computational restrictions. We propose a general approach for accurately and efficiently estimating small $p$-values for a broad range of complicated test statistics based on the principle of the cross-entropy method and Markov chain Monte Carlo sampling techniques. $\textbf{Results:}$ We evaluate the performance of the proposed algorithm through simulations and demonstrate its application to three real-world examples in genomic studies. The results show that our approach can accurately evaluate small to extremely small $p$-values (e.g. $10^{-6}$ to $10^{-100}$). The proposed algorithm is helpful for the improvement of some existing test procedures and the development of new test procedures in genomic studies.
△ Less
Submitted 25 August, 2023; v1 submitted 8 March, 2018;
originally announced March 2018.
-
Quantifying Gerrymandering in North Carolina
Authors:
Gregory Herschlag,
Han Sung Kang,
Justin Luo,
Christy Vaughn Graves,
Sachet Bangia,
Robert Ravier,
Jonathan C. Mattingly
Abstract:
Using an ensemble of redistricting plans, we evaluate whether a given political districting faithfully represents the geo-political landscape. Redistricting plans are sampled by a Monte Carlo algorithm from a probability distribution that adheres to realistic and non-partisan criteria. Using the sampled redistricting plans and historical voting data, we produce an ensemble of elections that reveal…
▽ More
Using an ensemble of redistricting plans, we evaluate whether a given political districting faithfully represents the geo-political landscape. Redistricting plans are sampled by a Monte Carlo algorithm from a probability distribution that adheres to realistic and non-partisan criteria. Using the sampled redistricting plans and historical voting data, we produce an ensemble of elections that reveal geo-political structure within the state. We showcase our methods on the two most recent districtings of NC for the U.S. House of Representatives, as well as a plan drawn by a bipartisan redistricting panel. We find the two state enacted plans are highly atypical outliers whereas the bipartisan plan accurately represents the ensemble both in partisan outcome and in the fine scale structure of district-level results.
△ Less
Submitted 10 January, 2018;
originally announced January 2018.
-
Manifold Data Analysis with Applications to High-Frequency 3D Imaging
Authors:
Hyun Bin Kang,
Matthew Reimherr,
Mark Shriver,
Peter Claes
Abstract:
Many scientific areas are faced with the challenge of extracting information from large, complex, and highly structured data sets. A great deal of modern statistical work focuses on developing tools for handling such data. This paper presents a new subfield of functional data analysis, FDA, which we call Manifold Data Analysis, or MDA. MDA is concerned with the statistical analysis of samples wher…
▽ More
Many scientific areas are faced with the challenge of extracting information from large, complex, and highly structured data sets. A great deal of modern statistical work focuses on developing tools for handling such data. This paper presents a new subfield of functional data analysis, FDA, which we call Manifold Data Analysis, or MDA. MDA is concerned with the statistical analysis of samples where one or more variables measured on each unit is a manifold, thus resulting in as many manifolds as we have units. We propose a framework that converts manifolds into functional objects, an efficient 2-step functional principal component method, and a manifold-on-scalar regression model. This work is motivated by an anthropological application involving 3D facial imaging data, which is discussed extensively throughout the paper. The proposed framework is used to understand how individual characteristics, such as age and genetic ancestry, influence the shape of the human face.
△ Less
Submitted 4 October, 2017;
originally announced October 2017.
-
Markov Network for Modeling Local Item Dependence in Cognitively Diagnostic Classification Models
Authors:
Hyeon-Ah Kang,
Jingchen Liu,
Zhiliang Ying
Abstract:
The study presents an exploratory graphical modeling approach for evaluating local item dependency within cognitively diagnostic classification models (DCMs). Current approaches to modeling local dependence require known item structure and have limited utility when such information is not available. In this study, we propose an exploratory approach to modeling local dependence so that items' own i…
▽ More
The study presents an exploratory graphical modeling approach for evaluating local item dependency within cognitively diagnostic classification models (DCMs). Current approaches to modeling local dependence require known item structure and have limited utility when such information is not available. In this study, we propose an exploratory approach to modeling local dependence so that items' own interactions can be revealed without dependency specification. The new framework is developed by integrating a Markov network into a generalized DCM. The framework unveils item interactions while performing regular cognitive diagnosis within a unified scheme. The inference on the model parameters is made on the regularized pseudo-likelihood and is implemented by an EM algorithm. Numerical experimentation from Monte Carlo simulation suggests that the proposed framework adequately recovers generating parameters and yields reliable standard error estimates. Compared with the regular DCM, the new model produced more accurate item parameter estimates as items exhibit local dependence. The study demonstrates application of the model using two real assessment data and discusses practical benefits of modeling local dependence.
△ Less
Submitted 26 May, 2023; v1 submitted 19 July, 2017;
originally announced July 2017.
-
Redistricting: Drawing the Line
Authors:
Sachet Bangia,
Christy Vaughn Graves,
Gregory Herschlag,
Han Sung Kang,
Justin Luo,
Jonathan C. Mattingly,
Robert Ravier
Abstract:
We develop methods to evaluate whether a political districting accurately represents the will of the people. To explore and showcase our ideas, we concentrate on the congressional districts for the U.S. House of representatives and use the state of North Carolina and its redistrictings since the 2010 census. Using a Monte Carlo algorithm, we randomly generate over 24,000 redistrictings that are no…
▽ More
We develop methods to evaluate whether a political districting accurately represents the will of the people. To explore and showcase our ideas, we concentrate on the congressional districts for the U.S. House of representatives and use the state of North Carolina and its redistrictings since the 2010 census. Using a Monte Carlo algorithm, we randomly generate over 24,000 redistrictings that are non-partisan and adhere to criteria from proposed legislation. Applying historical voting data to these random redistrictings, we find that the number of democratic and republican representatives elected varies drastically depending on how districts are drawn. Some results are more common, and we gain a clear range of expected election outcomes. Using the statistics of our generated redistrictings, we critique the particular congressional districtings used in the 2012 and 2016 NC elections as well as a districting proposed by a bipartisan redistricting commission. We find that the 2012 and 2016 districtings are highly atypical and not representative of the will of the people. On the other hand, our results indicate that a plan produced by a bipartisan panel of retired judges is highly typical and representative. Since our analyses are based on an ensemble of reasonable redistrictings of North Carolina, they provide a baseline for a given election which incorporates the geometry of the state's population distribution.
△ Less
Submitted 8 May, 2017; v1 submitted 9 April, 2017;
originally announced April 2017.
-
Peer Encouragement Designs in Causal Inference with Partial Interference and Identification of Local Average Network Effects
Authors:
Hyunseung Kang,
Guido Imbens
Abstract:
In non-network settings, encouragement designs have been widely used to analyze causal effects of a treatment, policy, or intervention on an outcome of interest when randomizing the treatment was considered impractical or when compliance to treatment cannot be perfectly enforced. Unfortunately, such questions related to treatment compliance have received less attention in network settings and the…
▽ More
In non-network settings, encouragement designs have been widely used to analyze causal effects of a treatment, policy, or intervention on an outcome of interest when randomizing the treatment was considered impractical or when compliance to treatment cannot be perfectly enforced. Unfortunately, such questions related to treatment compliance have received less attention in network settings and the most well-studied experimental design in networks, the two-stage randomization design, requires perfect compliance with treatment. The paper proposes a new experimental design called peer encouragement design to study network treatment effects when enforcing treatment randomization is not feasible. The key idea in peer encouragement design is the idea of personalized encouragement, which allows point-identification of familiar estimands in the encouragement design literature. The paper also defines new causal estimands, local average network effects, that can be identified under the new design and analyzes the effect of non-compliance behavior in randomized experiments on networks.
△ Less
Submitted 14 September, 2016;
originally announced September 2016.
-
Efficiently estimating small p-values in permutation tests using importance sampling and cross-entropy method
Authors:
Yang Shi,
Huining Kang,
Ji-Hyun Lee,
Hui Jiang
Abstract:
Permutation tests are widely used for statistical hypothesis testing when the sampling distribution of the test statistic under the null hypothesis is analytically intractable or unreliable due to finite sample sizes. One critical challenge in the application of permutation tests in genomic studies is that an enormous number of permutations are often needed to obtain reliable estimates of very sma…
▽ More
Permutation tests are widely used for statistical hypothesis testing when the sampling distribution of the test statistic under the null hypothesis is analytically intractable or unreliable due to finite sample sizes. One critical challenge in the application of permutation tests in genomic studies is that an enormous number of permutations are often needed to obtain reliable estimates of very small $p$-values, leading to intensive computational effort. To address this issue, we develop algorithms for the accurate and efficient estimation of small $p$-values in permutation tests for paired and independent two-group genomic data, and our approaches leverage a novel framework for parameterizing the permutation sample spaces of those two types of data respectively using the Bernoulli and conditional Bernoulli distributions, combined with the cross-entropy method. The performance of our proposed algorithms is demonstrated through the application to two simulated datasets and two real-world gene expression datasets generated by microarray and RNA-Seq technologies and comparisons to existing methods such as crude permutations and SAMC, and the results show that our approaches can achieve orders of magnitude of computational efficiency gains in estimating small $p$-values. Our approaches offer promising solutions for the improvement of computational efficiencies of existing permutation test procedures and the development of new testing methods using permutations in genomic data analysis.
△ Less
Submitted 25 August, 2023; v1 submitted 29 July, 2016;
originally announced August 2016.
-
Inference for Instrumental Variables: A Randomization Inference Approach
Authors:
Hyunseung Kang,
Laura Peck,
Luke Keele
Abstract:
The method of instrumental variables (IV) provides a framework to study causal effects in both randomized experiments with noncompliance and in observational studies where natural circumstances produce as-if random nudges to accept treatment. Traditionally, inference for IV relied on asymptotic approximations of the distribution of the Wald estimator or two-stage least squares, often with structur…
▽ More
The method of instrumental variables (IV) provides a framework to study causal effects in both randomized experiments with noncompliance and in observational studies where natural circumstances produce as-if random nudges to accept treatment. Traditionally, inference for IV relied on asymptotic approximations of the distribution of the Wald estimator or two-stage least squares, often with structural modeling assumptions and/or moment conditions. In this paper, we utilize the randomization inference approach to IV inference. First, we outline the exact method, which uses the randomized assignment of treatment in experiments as a basis for inference, but lacks a closed-form solution and may be computationally infeasible in many applications. We then provide an alternative to the exact method, the almost exact method, which is computationally feasible but retains the advantages of the exact method. We also review asymptotic methods of inference, including those associated with two-stage least squares, and analytically compare them to randomization inference methods. We also perform additional comparisons using a set of simulations. We conclude with three different applications from the social sciences.
△ Less
Submitted 6 February, 2018; v1 submitted 13 June, 2016;
originally announced June 2016.
-
Estimating Treatment Effects using Multiple Surrogates: The Role of the Surrogate Score and the Surrogate Index
Authors:
Susan Athey,
Raj Chetty,
Guido Imbens,
Hyunseung Kang
Abstract:
Estimating the long-term effects of treatments is of interest in many fields. A common challenge in estimating such treatment effects is that long-term outcomes are unobserved in the time frame needed to make policy decisions. One approach to overcome this missing data problem is to analyze treatments effects on an intermediate outcome, often called a statistical surrogate, if it satisfies the con…
▽ More
Estimating the long-term effects of treatments is of interest in many fields. A common challenge in estimating such treatment effects is that long-term outcomes are unobserved in the time frame needed to make policy decisions. One approach to overcome this missing data problem is to analyze treatments effects on an intermediate outcome, often called a statistical surrogate, if it satisfies the condition that treatment and outcome are independent conditional on the statistical surrogate. The validity of the surrogacy condition is often controversial. Here we exploit that fact that in modern datasets, researchers often observe a large number, possibly hundreds or thousands, of intermediate outcomes, thought to lie on or close to the causal chain between the treatment and the long-term outcome of interest. Even if none of the individual proxies satisfies the statistical surrogacy criterion by itself, using multiple proxies can be useful in causal inference. We focus primarily on a setting with two samples, an experimental sample containing data about the treatment indicator and the surrogates and an observational sample containing information about the surrogates and the primary outcome. We state assumptions under which the average treatment effect be identified and estimated with a high-dimensional vector of proxies that collectively satisfy the surrogacy assumption, and derive the bias from violations of the surrogacy assumption, and show that even if the primary outcome is also observed in the experimental sample, there is still information to be gained from using surrogates.
△ Less
Submitted 21 August, 2024; v1 submitted 30 March, 2016;
originally announced March 2016.
-
Confidence Intervals for Causal Effects with Invalid Instruments using Two-Stage Hard Thresholding with Voting
Authors:
Zijian Guo,
Hyunseung Kang,
T. Tony Cai,
Dylan S. Small
Abstract:
A major challenge in instrumental variables (IV) analysis is to find instruments that are valid, or have no direct effect on the outcome and are ignorable. Typically one is unsure whether all of the putative IVs are in fact valid. We propose a general inference procedure in the presence of invalid IVs, called Two-Stage Hard Thresholding (TSHT) with voting. TSHT uses two hard thresholding steps to…
▽ More
A major challenge in instrumental variables (IV) analysis is to find instruments that are valid, or have no direct effect on the outcome and are ignorable. Typically one is unsure whether all of the putative IVs are in fact valid. We propose a general inference procedure in the presence of invalid IVs, called Two-Stage Hard Thresholding (TSHT) with voting. TSHT uses two hard thresholding steps to select strong instruments and generate candidate sets of valid IVs. Voting takes the candidate sets and uses majority and plurality rules to determine the true set of valid IVs. In low dimensions, if the sufficient and necessary identification condition under invalid instruments is met, which is more general than the so-called 50% rule or the majority rule, our proposal (i) correctly selects valid IVs, (ii) consistently estimates the causal effect, (iii) produces valid confidence intervals for the causal effect, and (iv) has oracle-optimal width. In high dimensions, we establish nearly identical results without oracle-optimality. In simulations, our proposal outperforms traditional and recent methods in the invalid IV literature. We also apply our method to re-analyze the causal effect of education on earnings.
△ Less
Submitted 8 August, 2017; v1 submitted 16 March, 2016;
originally announced March 2016.
-
A simple and robust confidence interval for causal effects with possibly invalid instruments
Authors:
Hyunseung Kang,
T. Tony Cai,
Dylan S. Small
Abstract:
Instrumental variables have been widely used to estimate the causal effect of a treatment on an outcome. Existing confidence intervals for causal effects based on instrumental variables assume that all of the putative instrumental variables are valid; a valid instrumental variable is a variable that affects the outcome only by affecting the treatment and is not related to unmeasured confounders. H…
▽ More
Instrumental variables have been widely used to estimate the causal effect of a treatment on an outcome. Existing confidence intervals for causal effects based on instrumental variables assume that all of the putative instrumental variables are valid; a valid instrumental variable is a variable that affects the outcome only by affecting the treatment and is not related to unmeasured confounders. However, in practice, some of the putative instrumental variables are likely to be invalid. This paper presents a simple and general approach to construct a confidence interval that is robust to possibly invalid instruments. The robust confidence interval has theoretical guarantees on having the correct coverage and can also be used to assess the sensitivity of inference when instrumental variables assumptions are violated. The paper also shows that the robust confidence interval outperforms traditional confidence intervals popular in instrumental variables literature when invalid instruments are present. The new approach is applied to a developmental economics study of the causal effect of income on food expenditures.
△ Less
Submitted 12 July, 2016; v1 submitted 14 April, 2015;
originally announced April 2015.
-
Full Matching Approach to Instrumental Variables Estimation with Application to the Effect of Malaria on Stunting
Authors:
Hyunseung Kang,
Benno Kreuels,
Jürgen May,
Dylan S. Small
Abstract:
Most previous studies of the causal relationship between malaria and stunting have been studies where potential confounders are controlled via regression-based methods, but these studies may have been biased by unobserved confounders. Instrumental variables (IV) regression offers a way to control for unmeasured confounders where, in our case, the sickle cell trait can be used as an instrument. How…
▽ More
Most previous studies of the causal relationship between malaria and stunting have been studies where potential confounders are controlled via regression-based methods, but these studies may have been biased by unobserved confounders. Instrumental variables (IV) regression offers a way to control for unmeasured confounders where, in our case, the sickle cell trait can be used as an instrument. However, for the instrument to be valid, it may still be important to account for measured confounders. The most commonly used instrumental variable regression method, two-stage least squares, relies on parametric assumptions on the effects of measured confounders to account for them. Additionally, two-stage least squares lacks transparency with respect to covariate balance and weighing of subjects and does not blind the researcher to the outcome data. To address these drawbacks, we propose an alternative method for IV estimation based on full matching. We evaluate our new procedure on simulated data and real data concerning the causal effect of malaria on stunting among children. We estimate that the risk of stunting among children with the sickle cell trait decrease by 0.22 times the average number of malaria episodes prevented by the sickle cell trait, a substantial effect of malaria on stunting (p-value: 0.011, 95% CI: 0.044, 1).
△ Less
Submitted 10 November, 2015; v1 submitted 26 November, 2014;
originally announced November 2014.
-
Instrumental Variables Estimation with Some Invalid Instruments and its Application to Mendelian Randomization
Authors:
Hyunseung Kang,
Anru Zhang,
T. Tony Cai,
Dylan S. Small
Abstract:
Instrumental variables have been widely used for estimating the causal effect between exposure and outcome. Conventional estimation methods require complete knowledge about all the instruments' validity; a valid instrument must not have a direct effect on the outcome and not be related to unmeasured confounders. Often, this is impractical as highlighted by Mendelian randomization studies where gen…
▽ More
Instrumental variables have been widely used for estimating the causal effect between exposure and outcome. Conventional estimation methods require complete knowledge about all the instruments' validity; a valid instrument must not have a direct effect on the outcome and not be related to unmeasured confounders. Often, this is impractical as highlighted by Mendelian randomization studies where genetic markers are used as instruments and complete knowledge about instruments' validity is equivalent to complete knowledge about the involved genes' functions.
In this paper, we propose a method for estimation of causal effects when this complete knowledge is absent. It is shown that causal effects are identified and can be estimated as long as less than $50$% of instruments are invalid, without knowing which of the instruments are invalid. We also introduce conditions for identification when the 50% threshold is violated. A fast penalized $\ell_1$ estimation method, called sisVIVE, is introduced for estimating the causal effect without knowing which instruments are valid, with theoretical guarantees on its performance. The proposed method is demonstrated on simulated data and a real Mendelian randomization study concerning the effect of body mass index on health-related quality of life index. An R package \emph{sisVIVE} is available online.
△ Less
Submitted 21 September, 2014; v1 submitted 22 January, 2014;
originally announced January 2014.
-
K-Adaptive Partitioning for Survival Data, with an Application to Cancer Staging
Authors:
Soo-Heang Eo,
Hyo Jeong Kang,
Seung-Mo Hong,
HyungJun Cho
Abstract:
In medical research, it is often needed to obtain subgroups with heterogeneous survivals, which have been predicted from a prognostic factor. For this purpose, a binary split has often been used once or recursively; however, binary partitioning may not provide an optimal set of well separated subgroups. We propose a multi-way partitioning algorithm, which divides the data into K heterogeneous subg…
▽ More
In medical research, it is often needed to obtain subgroups with heterogeneous survivals, which have been predicted from a prognostic factor. For this purpose, a binary split has often been used once or recursively; however, binary partitioning may not provide an optimal set of well separated subgroups. We propose a multi-way partitioning algorithm, which divides the data into K heterogeneous subgroups based on the information from a prognostic factor. The resulting subgroups show significant differences in survival. Such a multi-way partition is found by maximizing the minimum of the subgroup pairwise test statistics. An optimal number of subgroups is determined by a permutation test. Our developed algorithm is compared with two binary recursive partitioning algorithms. In addition, its usefulness is demonstrated with a real data of colorectal cancer cases from the Surveillance Epidemiology and End Results program. We have implemented our algorithm into an R package maps, which is freely available in the Comprehensive R Archive Network (CRAN).
△ Less
Submitted 1 November, 2014; v1 submitted 19 June, 2013;
originally announced June 2013.