-
Clustering and Pruning in Causal Data Fusion
Authors:
Otto Tabell,
Santtu Tikka,
Juha Karvanen
Abstract:
Data fusion, the process of combining observational and experimental data, can enable the identification of causal effects that would otherwise remain non-identifiable. Although identification algorithms have been developed for specific scenarios, do-calculus remains the only general-purpose tool for causal data fusion, particularly when variables are present in some data sources but not others. H…
▽ More
Data fusion, the process of combining observational and experimental data, can enable the identification of causal effects that would otherwise remain non-identifiable. Although identification algorithms have been developed for specific scenarios, do-calculus remains the only general-purpose tool for causal data fusion, particularly when variables are present in some data sources but not others. However, approaches based on do-calculus may encounter computational challenges as the number of variables increases and the causal graph grows in complexity. Consequently, there exists a need to reduce the size of such models while preserving the essential features. For this purpose, we propose pruning (removing unnecessary variables) and clustering (combining variables) as preprocessing operations for causal data fusion. We generalize earlier results on a single data source and derive conditions for applying pruning and clustering in the case of multiple data sources. We give sufficient conditions for inferring the identifiability or non-identifiability of a causal effect in a larger graph based on a smaller graph and show how to obtain the corresponding identifying functional for identifiable causal effects. Examples from epidemiology and social science demonstrate the use of the results.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Monotone Missing Data: A Blessing and a Curse
Authors:
Santtu Tikka,
Juha Karvanen
Abstract:
Monotone missingness is commonly encountered in practice where a missing measurement compels another measurement to be missing. In graphical missing data models, monotonicity has implications for the identifiability of the full law, i.e., the joint distribution of actual variables and response indicators. In the general nonmonotone case, the full law is known to be nonparametrically identifiable i…
▽ More
Monotone missingness is commonly encountered in practice where a missing measurement compels another measurement to be missing. In graphical missing data models, monotonicity has implications for the identifiability of the full law, i.e., the joint distribution of actual variables and response indicators. In the general nonmonotone case, the full law is known to be nonparametrically identifiable if and only if neither colluders nor self-censoring edges are present in the graph. We show that monotonicity may enable the identification of the full law despite colluders and prevent the identification under mediated (pathwise) self-censoring. The results emphasize the importance of proper treatment of monotone missingness in the analysis of incomplete data.
△ Less
Submitted 6 November, 2024;
originally announced November 2024.
-
Dynamic programming principle in cost-efficient sequential design: application to switching measurements
Authors:
Jeongmin Han,
Juha Karvanen,
Mikko Parviainen
Abstract:
We study sequential cost-efficient design in a situation where each update of covariates involves a fixed time cost typically considerable compared to a single measurement time. The problem arises from parameter estimation in switching measurements on superconducting Josephson junctions which are components needed in quantum computers and other superconducting electronics. In switching measurement…
▽ More
We study sequential cost-efficient design in a situation where each update of covariates involves a fixed time cost typically considerable compared to a single measurement time. The problem arises from parameter estimation in switching measurements on superconducting Josephson junctions which are components needed in quantum computers and other superconducting electronics. In switching measurements, a sequence of current pulses is applied to the junction and a binary voltage response is observed. The measurement requires a very low temperature that can be kept stable only for a relatively short time, and therefore it is essential to use an efficient design. We use the dynamic programming principle from the mathematical theory of optimal control to solve the optimal update times. Our simulations demonstrate the cost-efficiency compared to the previously used methods.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
Full Law Identification under Missing Data with Categorical Variables
Authors:
Santtu Tikka,
Juha Karvanen
Abstract:
Missing data may be disastrous for the identifiability of causal and statistical estimands. In graphical missing data models, colluders are dependence structures that have a special importance for identification considerations. It has been shown that the presence of a colluder makes the full law, i.e., the joint distribution of variables and response indicators, non-parametrically non-identifiable…
▽ More
Missing data may be disastrous for the identifiability of causal and statistical estimands. In graphical missing data models, colluders are dependence structures that have a special importance for identification considerations. It has been shown that the presence of a colluder makes the full law, i.e., the joint distribution of variables and response indicators, non-parametrically non-identifiable. However, when the variables related to the colluder structure are categorical, it is sometimes possible to regain the identifiability of the full law. We present a necessary and sufficient condition for the identification of the full law in the presence of colluder structures with arbitrary categorical variables. Maximum likelihood estimation of the full law in identifiable models with categorical variables is demonstrated with simulated and real data.
△ Less
Submitted 3 July, 2024; v1 submitted 8 February, 2024;
originally announced February 2024.
-
Simulating counterfactuals
Authors:
Juha Karvanen,
Santtu Tikka,
Matti Vihola
Abstract:
Counterfactual inference considers a hypothetical intervention in a parallel world that shares some evidence with the factual world. If the evidence specifies a conditional distribution on a manifold, counterfactuals may be analytically intractable. We present an algorithm for simulating values from a counterfactual distribution where conditions can be set on both discrete and continuous variables…
▽ More
Counterfactual inference considers a hypothetical intervention in a parallel world that shares some evidence with the factual world. If the evidence specifies a conditional distribution on a manifold, counterfactuals may be analytically intractable. We present an algorithm for simulating values from a counterfactual distribution where conditions can be set on both discrete and continuous variables. We show that the proposed algorithm can be presented as a particle filter leading to asymptotically valid inference. The algorithm is applied to fairness analysis in credit-scoring.
△ Less
Submitted 26 March, 2024; v1 submitted 27 June, 2023;
originally announced June 2023.
-
Price Optimization Combining Conjoint Data and Purchase History: A Causal Modeling Approach
Authors:
Lauri Valkonen,
Santtu Tikka,
Jouni Helske,
Juha Karvanen
Abstract:
Pricing decisions of companies require an understanding of the causal effect of a price change on the demand. When real-life pricing experiments are infeasible, data-driven decision-making must be based on alternative data sources such as purchase history (sales data) and conjoint studies where a group of customers is asked to make imaginary purchases in an artificial setup. We present an approach…
▽ More
Pricing decisions of companies require an understanding of the causal effect of a price change on the demand. When real-life pricing experiments are infeasible, data-driven decision-making must be based on alternative data sources such as purchase history (sales data) and conjoint studies where a group of customers is asked to make imaginary purchases in an artificial setup. We present an approach for price optimization that combines population statistics, purchase history and conjoint data in a systematic way. We build on the recent advances in causal inference to identify and quantify the effect of price on the purchase probability at the customer level. The identification task is a transportability problem whose solution requires a parametric assumption on the differences between the conjoint study and real purchases. The causal effect is estimated using Bayesian methods that take into account the uncertainty of the data sources. The pricing decision is made by comparing the estimated posterior distributions of gross profit for different prices. The approach is demonstrated with simulated data resembling the features of real-world data.
△ Less
Submitted 30 April, 2024; v1 submitted 29 March, 2023;
originally announced March 2023.
-
Generalizing experimental findings: identification beyond adjustments
Authors:
Juha Karvanen
Abstract:
We aim to generalize the results of a randomized controlled trial (RCT) to a target population with the help of some observational data. This is a problem of causal effect identification with multiple data sources. Challenges arise when the RCT is conducted in a context that differs from the target population. Earlier research has focused on cases where the estimates from the RCT can be adjusted b…
▽ More
We aim to generalize the results of a randomized controlled trial (RCT) to a target population with the help of some observational data. This is a problem of causal effect identification with multiple data sources. Challenges arise when the RCT is conducted in a context that differs from the target population. Earlier research has focused on cases where the estimates from the RCT can be adjusted by observational data in order to remove the selection bias and other domain specific differences. We consider examples where the experimental findings cannot be generalized by an adjustment and show that the generalization may still be possible by other identification strategies that can be derived by applying do-calculus. The obtained identifying functionals for these examples contain trapdoor variables of a new type. The value of a trapdoor variable needs to be fixed in the estimation and the choice of the value may have a major effect on the bias and accuracy of estimates, which is also seen in simulations. The presented results expand the scope of settings where the generalization of experimental findings is doable
△ Less
Submitted 14 June, 2022;
originally announced June 2022.
-
Contrasting Identifying Assumptions of Average Causal Effects: Robustness and Semiparametric Efficiency
Authors:
Tetiana Gorbach,
Xavier de Luna,
Juha Karvanen,
Ingeborg Waernbaum
Abstract:
Semiparametric inference on average causal effects from observational data is based on assumptions yielding identification of the effects. In practice, several distinct identifying assumptions may be plausible; an analyst has to make a delicate choice between these models. In this paper, we study three identifying assumptions based on the potential outcome framework: the back-door assumption, whic…
▽ More
Semiparametric inference on average causal effects from observational data is based on assumptions yielding identification of the effects. In practice, several distinct identifying assumptions may be plausible; an analyst has to make a delicate choice between these models. In this paper, we study three identifying assumptions based on the potential outcome framework: the back-door assumption, which uses pre-treatment covariates, the front-door assumption, which uses mediators, and the two-door assumption using pre-treatment covariates and mediators simultaneously. We provide the efficient influence functions and the corresponding semiparametric efficiency bounds that hold under these assumptions, and their combinations. We demonstrate that neither of the identification models provides uniformly the most efficient estimation and give conditions under which some bounds are lower than others. We show when semiparametric estimating equation estimators based on influence functions attain the bounds, and study the robustness of the estimators to misspecification of the nuisance models. The theory is complemented with simulation experiments on the finite sample behavior of the estimators. The results obtained are relevant for an analyst facing a choice between several plausible identifying assumptions and corresponding estimators. Our results show that this choice implies a trade-off between efficiency and robustness to misspecification of the nuisance models.
△ Less
Submitted 17 February, 2023; v1 submitted 30 November, 2021;
originally announced November 2021.
-
Clustering and Structural Robustness in Causal Diagrams
Authors:
Santtu Tikka,
Jouni Helske,
Juha Karvanen
Abstract:
Graphs are commonly used to represent and visualize causal relations. For a small number of variables, this approach provides a succinct and clear view of the scenario at hand. As the number of variables under study increases, the graphical approach may become impractical, and the clarity of the representation is lost. Clustering of variables is a natural way to reduce the size of the causal diagr…
▽ More
Graphs are commonly used to represent and visualize causal relations. For a small number of variables, this approach provides a succinct and clear view of the scenario at hand. As the number of variables under study increases, the graphical approach may become impractical, and the clarity of the representation is lost. Clustering of variables is a natural way to reduce the size of the causal diagram, but it may erroneously change the essential properties of the causal relations if implemented arbitrarily. We define a specific type of cluster, called transit cluster, that is guaranteed to preserve the identifiability properties of causal effects under certain conditions. We provide a sound and complete algorithm for finding all transit clusters in a given graph and demonstrate how clustering can simplify the identification of causal effects. We also study the inverse problem, where one starts with a clustered graph and looks for extended graphs where the identifiability properties of causal effects remain unchanged. We show that this kind of structural robustness is closely related to transit clusters.
△ Less
Submitted 15 August, 2023; v1 submitted 8 November, 2021;
originally announced November 2021.
-
Simulation Framework for Realistic Large-scale Individual-level Data Generation with an Application in the Health Domain
Authors:
Santtu Tikka,
Jussi Hakanen,
Mirka Saarela,
Juha Karvanen
Abstract:
We propose a framework for realistic data generation and simulation of complex systems and demonstrate its capabilities in the health domain. The main use cases of the framework are predicting the development of risk factors and disease occurrence, evaluating the impact of interventions and policy decisions, and statistical method development. We present the fundamentals of the framework using rig…
▽ More
We propose a framework for realistic data generation and simulation of complex systems and demonstrate its capabilities in the health domain. The main use cases of the framework are predicting the development of risk factors and disease occurrence, evaluating the impact of interventions and policy decisions, and statistical method development. We present the fundamentals of the framework using rigorous mathematical definitions. The framework supports calibration to a real population as well as various manipulations and data collection processes. The freely available open-source implementation in R embraces efficient data structures, parallel computing and fast random number generation which ensure reproducibility and scalability. With the framework it is possible to run daily-level simulations for populations of millions of individuals for decades of simulated time. An example on the occurrence of stroke, type 2 diabetes and mortality illustrates the usage of the framework in the Finnish context. In the example, we demonstrate the data-collection functionality by studying the impact of non-participation on the estimated risk models and interventions related to controlling the additional salt intake.
△ Less
Submitted 5 June, 2021; v1 submitted 31 August, 2020;
originally announced August 2020.
-
Do-search -- a tool for causal inference and study design with multiple data sources
Authors:
Juha Karvanen,
Santtu Tikka,
Antti Hyttinen
Abstract:
Epidemiological evidence is based on multiple data sources including clinical trials, cohort studies, surveys, registries and expert opinions. Merging information from different sources opens up new possibilities for the estimation of causal effects. We show how causal effects can be identified and estimated by combining experiments and observations in real and realistic scenarios. As a new tool,…
▽ More
Epidemiological evidence is based on multiple data sources including clinical trials, cohort studies, surveys, registries and expert opinions. Merging information from different sources opens up new possibilities for the estimation of causal effects. We show how causal effects can be identified and estimated by combining experiments and observations in real and realistic scenarios. As a new tool, we present do-search, a recently developed algorithmic approach that can determine the identifiability of a causal effect. The approach is based on do-calculus, and it can utilize data with non-trivial missing data and selection bias mechanisms. When the effect is identifiable, do-search outputs an identifying formula on which numerical estimation can be based. When the effect is not identifiable, we can use do-search to recognize additional data sources and assumptions that would make the effect identifiable. Throughout the paper, we consider the effect of salt-adding behavior on blood pressure mediated by the salt intake as an example. The identifiability of this effect is resolved in various scenarios with different assumptions on confounding. There are scenarios where the causal effect is identifiable from a chain of experiments but not from survey data, as well as scenarios where the opposite is true. As an illustration, we use survey data from NHANES 2013--2016 and the results from a meta-analysis of randomized controlled trials and estimate the reduction in average systolic blood pressure under an intervention where the use of table salt is discontinued.
△ Less
Submitted 16 July, 2020;
originally announced July 2020.
-
Estimation of causal effects with small data in the presence of trapdoor variables
Authors:
Jouni Helske,
Santtu Tikka,
Juha Karvanen
Abstract:
We consider the problem of estimating causal effects of interventions from observational data when well-known back-door and front-door adjustments are not applicable. We show that when an identifiable causal effect is subject to an implicit functional constraint that is not deducible from conditional independence relations, the estimator of the causal effect can exhibit bias in small samples. This…
▽ More
We consider the problem of estimating causal effects of interventions from observational data when well-known back-door and front-door adjustments are not applicable. We show that when an identifiable causal effect is subject to an implicit functional constraint that is not deducible from conditional independence relations, the estimator of the causal effect can exhibit bias in small samples. This bias is related to variables that we call trapdoor variables. We use simulated data to study different strategies to account for trapdoor variables and suggest how the related trapdoor bias might be minimized. The importance of trapdoor variables in causal effect estimation is illustrated with real data from the Life Course 1971-2002 study. Using this dataset, we estimate the causal effect of education on income in the Finnish context. Bayesian modelling allows us to take the parameter uncertainty into account and to present the estimated causal effects as posterior distributions.
△ Less
Submitted 24 March, 2021; v1 submitted 6 March, 2020;
originally announced March 2020.
-
Causal Effect Identification from Multiple Incomplete Data Sources: A General Search-based Approach
Authors:
Santtu Tikka,
Antti Hyttinen,
Juha Karvanen
Abstract:
Causal effect identification considers whether an interventional probability distribution can be uniquely determined without parametric assumptions from measured source distributions and structural knowledge on the generating system. While complete graphical criteria and procedures exist for many identification problems, there are still challenging but important extensions that have not been consi…
▽ More
Causal effect identification considers whether an interventional probability distribution can be uniquely determined without parametric assumptions from measured source distributions and structural knowledge on the generating system. While complete graphical criteria and procedures exist for many identification problems, there are still challenging but important extensions that have not been considered in the literature. To tackle these new settings, we present a search algorithm directly over the rules of do-calculus. Due to generality of do-calculus, the search is capable of taking more advanced data-generating mechanisms into account along with an arbitrary type of both observational and experimental source distributions. The search is enhanced via a heuristic and search space reduction techniques. The approach, called do-search, is provably sound, and it is complete with respect to identifiability problems that have been shown to be completely characterized by do-calculus. When extended with additional rules, the search is capable of handling missing data problems as well. With the versatile search, we are able to approach new problems such as combined transportability and selection bias, or multiple sources of selection bias. We perform a systematic analysis of bivariate missing data problems and study causal inference under case-control design. We also present the R package dosearch that provides an interface for a C++ implementation of the search.
△ Less
Submitted 27 August, 2021; v1 submitted 4 February, 2019;
originally announced February 2019.
-
Surrogate Outcomes and Transportability
Authors:
Santtu Tikka,
Juha Karvanen
Abstract:
Identification of causal effects is one of the most fundamental tasks of causal inference. We consider an identifiability problem where some experimental and observational data are available but neither data alone is sufficient for the identification of the causal effect of interest. Instead of the outcome of interest, surrogate outcomes are measured in the experiments. This problem is a generaliz…
▽ More
Identification of causal effects is one of the most fundamental tasks of causal inference. We consider an identifiability problem where some experimental and observational data are available but neither data alone is sufficient for the identification of the causal effect of interest. Instead of the outcome of interest, surrogate outcomes are measured in the experiments. This problem is a generalization of identifiability using surrogate experiments and we label it as surrogate outcome identifiability. We show that the concept of transportability provides a sufficient criteria for determining surrogate outcome identifiability for a large class of queries.
△ Less
Submitted 12 March, 2019; v1 submitted 19 June, 2018;
originally announced June 2018.
-
Identifying Causal Effects with the R Package causaleffect
Authors:
Santtu Tikka,
Juha Karvanen
Abstract:
Do-calculus is concerned with estimating the interventional distribution of an action from the observed joint probability distribution of the variables in a given causal structure. All identifiable causal effects can be derived using the rules of do-calculus, but the rules themselves do not give any direct indication whether the effect in question is identifiable or not. Shpitser and Pearl constru…
▽ More
Do-calculus is concerned with estimating the interventional distribution of an action from the observed joint probability distribution of the variables in a given causal structure. All identifiable causal effects can be derived using the rules of do-calculus, but the rules themselves do not give any direct indication whether the effect in question is identifiable or not. Shpitser and Pearl constructed an algorithm for identifying joint interventional distributions in causal models, which contain unobserved variables and induce directed acyclic graphs. This algorithm can be seen as a repeated application of the rules of do-calculus and known properties of probabilities, and it ultimately either derives an expression for the causal distribution, or fails to identify the effect, in which case the effect is non-identifiable. In this paper, the R package causaleffect is presented, which provides an implementation of this algorithm. Functionality of causaleffect is also demonstrated through examples.
△ Less
Submitted 19 June, 2018;
originally announced June 2018.
-
Enhancing Identification of Causal Effects by Pruning
Authors:
Santtu Tikka,
Juha Karvanen
Abstract:
Causal models communicate our assumptions about causes and effects in real-world phe- nomena. Often the interest lies in the identification of the effect of an action which means deriving an expression from the observed probability distribution for the interventional distribution resulting from the action. In many cases an identifiability algorithm may return a complicated expression that contains…
▽ More
Causal models communicate our assumptions about causes and effects in real-world phe- nomena. Often the interest lies in the identification of the effect of an action which means deriving an expression from the observed probability distribution for the interventional distribution resulting from the action. In many cases an identifiability algorithm may return a complicated expression that contains variables that are in fact unnecessary. In practice this can lead to additional computational burden and increased bias or inefficiency of estimates when dealing with measurement error or missing data. We present graphical criteria to detect variables which are redundant in identifying causal effects. We also provide an improved version of a well-known identifiability algorithm that implements these criteria.
△ Less
Submitted 19 June, 2018;
originally announced June 2018.
-
Simplifying Probabilistic Expressions in Causal Inference
Authors:
Santtu Tikka,
Juha Karvanen
Abstract:
Obtaining a non-parametric expression for an interventional distribution is one of the most fundamental tasks in causal inference. Such an expression can be obtained for an identifiable causal effect by an algorithm or by manual application of do-calculus. Often we are left with a complicated expression which can lead to biased or inefficient estimates when missing data or measurement errors are i…
▽ More
Obtaining a non-parametric expression for an interventional distribution is one of the most fundamental tasks in causal inference. Such an expression can be obtained for an identifiable causal effect by an algorithm or by manual application of do-calculus. Often we are left with a complicated expression which can lead to biased or inefficient estimates when missing data or measurement errors are involved. We present an automatic simplification algorithm that seeks to eliminate symbolically unnecessary variables from these expressions by taking advantage of the structure of the underlying graphical model. Our method is applicable to all causal effect formulas and is readily available in the R package causaleffect.
△ Less
Submitted 19 June, 2018;
originally announced June 2018.
-
Adjusting for selective non-participation with re-contact data in the FINRISK 2012 survey
Authors:
Juho Kopra,
Tommi Härkänen,
Hanna Tolonen,
Pekka Jousilahti,
Kari Kuulasmaa,
Jaakko Reinikainen,
Juha Karvanen
Abstract:
Aims: A common objective of epidemiological surveys is to provide population-level estimates of health indicators. Survey results tend to be biased under selective non-participation. One approach to bias reduction is to collect information about non-participants by contacting them again and asking them to fill in a questionnaire. This information is called re-contact data, and it allows to adjust…
▽ More
Aims: A common objective of epidemiological surveys is to provide population-level estimates of health indicators. Survey results tend to be biased under selective non-participation. One approach to bias reduction is to collect information about non-participants by contacting them again and asking them to fill in a questionnaire. This information is called re-contact data, and it allows to adjust the estimates for non-participation.
Methods: We analyse data from the FINRISK 2012 survey, where re-contact data were collected. We assume that the respondents of the re-contact survey are similar to the remaining non-participants with respect to the health given their available background information. Validity of this assumption is evaluated based on the hospitalization data obtained through record linkage of survey data to the administrative registers. Using this assumption and multiple imputation, we estimate the prevalences of daily smoking and heavy alcohol consumption and compare them to estimates obtained with a commonly used assumption that the participants represent the entire target group.
Results: This approach produces higher prevalence estimates than what is estimated from participants only. Among men, smoking prevalence estimate was 28.5% (23.2% for participants), heavy alcohol consumption prevalence was 9.4% (6.8% for participants). Among women, smoking prevalence was 19.0% (16.5% for participants) and heavy alcohol consumption 4.8% (3.0% for participants). Conclusion: Utilization of re-contact data is a useful method to adjust for non-participation bias on population estimates in epidemiological surveys.
△ Less
Submitted 16 November, 2017;
originally announced November 2017.
-
Bayesian models for data missing not at random in health examination surveys
Authors:
Juho Kopra,
Juha Karvanen,
Tommi Härkänen
Abstract:
In epidemiological surveys, data missing not at random (MNAR) due to survey nonresponse may potentially lead to a bias in the risk factor estimates. We propose an approach based on Bayesian data augmentation and survival modelling to reduce the nonresponse bias. The approach requires additional information based on follow-up data. We present a case study of smoking prevalence using FINRISK data co…
▽ More
In epidemiological surveys, data missing not at random (MNAR) due to survey nonresponse may potentially lead to a bias in the risk factor estimates. We propose an approach based on Bayesian data augmentation and survival modelling to reduce the nonresponse bias. The approach requires additional information based on follow-up data. We present a case study of smoking prevalence using FINRISK data collected between 1972 and 2007 with a follow-up to the end of 2012 and compare it to other commonly applied missing at random (MAR) imputation approaches. A simulation experiment is carried out to study the validity of the approaches. Our approach appears to reduce the nonresponse bias substantially, where as MAR imputation was not successful in bias reduction.
△ Less
Submitted 28 August, 2017; v1 submitted 12 October, 2016;
originally announced October 2016.
-
Optimal design of observational studies: overview and synthesis
Authors:
Juha Karvanen,
Jarno Vanhatalo,
Kari Auranen,
Sangita Kulathinal,
Samu Mäntyniemi
Abstract:
We review typical design problems encountered in the planning of observational studies and propose a unifying framework that allows us to use the same concepts and notation for different problems. In the framework, the design is defined as a probability measure in the space of observational processes that determine whether the value of a variable is observed for a specific unit at the given time.…
▽ More
We review typical design problems encountered in the planning of observational studies and propose a unifying framework that allows us to use the same concepts and notation for different problems. In the framework, the design is defined as a probability measure in the space of observational processes that determine whether the value of a variable is observed for a specific unit at the given time. The optimal design is then defined, according to Bayesian decision theory, to be the one that maximizes the expected utility related to the design. We present examples on the use of the framework and discuss methods for deriving optimal or approximately optimal designs.
△ Less
Submitted 1 November, 2017; v1 submitted 27 September, 2016;
originally announced September 2016.
-
Bayesian subcohort selection for longitudinal covariate measurements in follow-up studies
Authors:
Jaakko Reinikainen,
Juha Karvanen
Abstract:
We consider planning longitudinal covariate measurements in follow-up studies where covariates are time-varying. We assume that the entire cohort cannot be selected for longitudinal measurements due to financial limitations and study how a subset of the cohort should be selected optimally in order to obtain precise estimates of covariate effects in a survival model. In our approach, the study will…
▽ More
We consider planning longitudinal covariate measurements in follow-up studies where covariates are time-varying. We assume that the entire cohort cannot be selected for longitudinal measurements due to financial limitations and study how a subset of the cohort should be selected optimally in order to obtain precise estimates of covariate effects in a survival model. In our approach, the study will be designed sequentially utilizing the data collected in previous measurements of the individuals as prior information. We propose using a Bayesian optimality criterion in the subcohort selections, which is compared with simple random sampling using simulated and real follow-up data. This study extends previous results where optimal subcohort selection was studied with only one re-measurement and one covariate, to more realistic cases where several covariates and measurement points are allowed. Our results support the conclusion that the precision of the estimates can be clearly improved by optimal design.
△ Less
Submitted 6 September, 2016;
originally announced September 2016.
-
Prioritizing covariates in the planning of future studies in the meta-analytic framework
Authors:
Juha Karvanen,
Mikko J. Sillanpää
Abstract:
Science can be seen as a sequential process where each new study augments evidence to the existing knowledge. To have the best prospects to make an impact in this process, a new study should be designed optimally taking into account the previous studies and other prior information. We propose a formal approach for the covariate prioritization, i.e., the decision about the covariates to be measured…
▽ More
Science can be seen as a sequential process where each new study augments evidence to the existing knowledge. To have the best prospects to make an impact in this process, a new study should be designed optimally taking into account the previous studies and other prior information. We propose a formal approach for the covariate prioritization, i.e., the decision about the covariates to be measured in a new study. The decision criteria can be based on conditional power, change of the p-value, change in lower confidence limit, Kullback-Leibler divergence, Bayes factors, Bayesian false discovery rate or difference between prior and posterior expectation. The criteria can be also used for decisions on the sample size. As an illustration, we consider covariate prioritization based on genome-wide association studies for C-reactive protein levels and make suggestions on the genes to be studied further.
keywords: design; evidence-based medicine; meta-analysis; power; scientific method
△ Less
Submitted 8 August, 2016;
originally announced August 2016.
-
Correcting for non-ignorable missingness in smoking trends
Authors:
Juho Kopra,
Tommi Härkänen,
Hanna Tolonen,
Juha Karvanen
Abstract:
Data missing not at random (MNAR) is a major challenge in survey sampling. We propose an approach based on registry data to deal with non-ignorable missingness in health examination surveys. The approach relies on follow-up data available from administrative registers several years after the survey. For illustration we use data on smoking prevalence in Finnish National FINRISK study conducted in 1…
▽ More
Data missing not at random (MNAR) is a major challenge in survey sampling. We propose an approach based on registry data to deal with non-ignorable missingness in health examination surveys. The approach relies on follow-up data available from administrative registers several years after the survey. For illustration we use data on smoking prevalence in Finnish National FINRISK study conducted in 1972-1997. The data consist of measured survey information including missingness indicators, register-based background information and register-based time-to-disease survival data. The parameters of missingness mechanism are estimable with these data although the original survey data are MNAR. The underlying data generation process is modelled by a Bayesian model. The results indicate that the estimated smoking prevalence rates in Finland may be significantly affected by missing data.
△ Less
Submitted 12 February, 2015;
originally announced February 2015.
-
Estimating complex causal effects from incomplete observational data
Authors:
Juha Karvanen
Abstract:
Despite the major advances taken in causal modeling, causality is still an unfamiliar topic for many statisticians. In this paper, it is demonstrated from the beginning to the end how causal effects can be estimated from observational data assuming that the causal structure is known. To make the problem more challenging, the causal effects are highly nonlinear and the data are missing at random. T…
▽ More
Despite the major advances taken in causal modeling, causality is still an unfamiliar topic for many statisticians. In this paper, it is demonstrated from the beginning to the end how causal effects can be estimated from observational data assuming that the causal structure is known. To make the problem more challenging, the causal effects are highly nonlinear and the data are missing at random. The tools used in the estimation include causal models with design, causal calculus, multiple imputation and generalized additive models. The main message is that a trained statistician can estimate causal effects by judiciously combining existing tools.
△ Less
Submitted 2 July, 2014; v1 submitted 5 March, 2014;
originally announced March 2014.
-
Survey data and Bayesian analysis: a cost-efficient way to estimate customer equity
Authors:
Juha Karvanen,
Ari Rantanen,
Lasse Luoma
Abstract:
We present a Bayesian framework for estimating the customer lifetime value (CLV) and the customer equity (CE) based on the purchasing behavior deducible from the market surveys on customer purchasing behavior. The proposed framework systematically addresses the challenges faced when the future value of customers is estimated based on survey data. The scarcity of the survey data and the sampling va…
▽ More
We present a Bayesian framework for estimating the customer lifetime value (CLV) and the customer equity (CE) based on the purchasing behavior deducible from the market surveys on customer purchasing behavior. The proposed framework systematically addresses the challenges faced when the future value of customers is estimated based on survey data. The scarcity of the survey data and the sampling variance are countered by utilizing the prior information and quantifying the uncertainty of the CE and CLV estimates by posterior distributions. Furthermore, information on the purchase behavior of the customers of competitors available in the survey data is integrated to the framework. The introduced approach is directly applicable in the domains where a customer relationship can be thought to be monogamous.
As an example on the use of the framework, we analyze a consumer survey on mobile phones carried out in Finland in February 2013. The survey data contains consumer given information on the current and previous brand of the phone and the times of the last two purchases.
△ Less
Submitted 30 May, 2014; v1 submitted 19 April, 2013;
originally announced April 2013.
-
Study design in causal models
Authors:
Juha Karvanen
Abstract:
The causal assumptions, the study design and the data are the elements required for scientific inference in empirical research. The research is adequately communicated only if all of these elements and their relations are described precisely. Causal models with design describe the study design and the missing data mechanism together with the causal structure and allow the direct application of cau…
▽ More
The causal assumptions, the study design and the data are the elements required for scientific inference in empirical research. The research is adequately communicated only if all of these elements and their relations are described precisely. Causal models with design describe the study design and the missing data mechanism together with the causal structure and allow the direct application of causal calculus in the estimation of the causal effects. The flow of the study is visualized by ordering the nodes of the causal diagram in two dimensions by their causal order and the time of the observation. Conclusions whether a causal or observational relationship can be estimated from the collected incomplete data can be made directly from the graph. Causal models with design offer a systematic and unifying view scientific inference and increase the clarity and speed of communication. Examples on the causal models for a case-control study, a nested case-control study, a clinical trial and a two-stage case-cohort study are presented.
△ Less
Submitted 24 April, 2014; v1 submitted 13 November, 2012;
originally announced November 2012.
-
Characterizing the generalized lambda distribution by L-moments
Authors:
Juha Karvanen,
Arto Nuutinen
Abstract:
The generalized lambda distribution (GLD) is a flexible four parameter distribution with many practical applications. L-moments of the GLD can be expressed in closed form and are good alternatives for the central moments. The L-moments of the GLD up to an arbitrary order are presented, and a study of L-skewness and L-kurtosis that can be achieved by the GLD is provided. The boundaries of L-skewn…
▽ More
The generalized lambda distribution (GLD) is a flexible four parameter distribution with many practical applications. L-moments of the GLD can be expressed in closed form and are good alternatives for the central moments. The L-moments of the GLD up to an arbitrary order are presented, and a study of L-skewness and L-kurtosis that can be achieved by the GLD is provided. The boundaries of L-skewness and L-kurtosis are derived analytically for the symmetric GLD and calculated numerically for the GLD in general. Additionally, the contours of L-skewness and L-kurtosis are presented as functions of the GLD parameters. It is found that with an exception of the smallest values of L-kurtosis, the GLD covers all possible pairs of L-skewness and L-kurtosis and often there are two or more distributions that share the same L-skewness and the same L-kurtosis. Examples that demonstrate situations where there are four GLD members with the same L-skewness and the same L-kurtosis are presented. The estimation of the GLD parameters is studied in a simulation example where method of L-moments compares favorably to more complicated estimation methods. The results increase the knowledge on the distributions that belong to the GLD family and can be utilized in model selection and estimation.
△ Less
Submitted 26 June, 2007; v1 submitted 15 January, 2007;
originally announced January 2007.
-
Efficient initial designs for binary response data
Authors:
Juha Karvanen
Abstract:
In this paper we introduce a binary search algorithm that efficiently finds initial maximum likelihood estimates for sequential experiments where a binary response is modeled by a continuous factor. The problem is motivated by switching measurements on superconducting Josephson junctions. In this quantum mechanical experiment, the current is the factor controlled by the experimenter and a binary…
▽ More
In this paper we introduce a binary search algorithm that efficiently finds initial maximum likelihood estimates for sequential experiments where a binary response is modeled by a continuous factor. The problem is motivated by switching measurements on superconducting Josephson junctions. In this quantum mechanical experiment, the current is the factor controlled by the experimenter and a binary response indicating the presence or the absence of a voltage response is measured. The prior knowledge on the model parameters is typically poor, which may cause the common approaches of initial estimation to fail. The binary search algorithm is designed to work reliably even when the prior information is very poor. The properties of the algorithm are studied in simulations and an advantage over the initial estimation with equally spaced factor levels is demonstrated. We also study the cost-efficiency of the binary search algorithm and find the approximately optimal number of measurements per stage when there is a cost related to the number of stages in the experiment.
KEY WORDS: optimal design, binary search, logistic regression, complementary log-log, quantum physics, switching measurement
△ Less
Submitted 6 February, 2008; v1 submitted 1 November, 2006;
originally announced November 2006.
-
Experimental Designs for Binary Data in Switching Measurements on Superconducting Josephson Junctions
Authors:
Juha Karvanen,
Juha J. Vartiainen,
Andrey Timofeev,
Jukka Pekola
Abstract:
We study the optimal design of switching measurements of small Josephson junction circuits which operate in the macroscopic quantum tunnelling regime. Starting from the D-optimality criterion we derive the optimal design for the estimation of the unknown parameters of the underlying Gumbel type distribution. As a practical method for the measurements, we propose a sequential design that combines…
▽ More
We study the optimal design of switching measurements of small Josephson junction circuits which operate in the macroscopic quantum tunnelling regime. Starting from the D-optimality criterion we derive the optimal design for the estimation of the unknown parameters of the underlying Gumbel type distribution. As a practical method for the measurements, we propose a sequential design that combines heuristic search for initial estimates and maximum likelihood estimation. The presented design has immediate applications in the area of superconducting electronics implying faster data acquisition. The presented experimental results confirm the usefulness of the method. KEY WORDS: optimal design, D-optimality, logistic regression, complementary log-log link, quantum physics, escape measurements
△ Less
Submitted 18 October, 2006;
originally announced October 2006.