Search | arXiv e-print repository

Clustering and Pruning in Causal Data Fusion

Authors: Otto Tabell, Santtu Tikka, Juha Karvanen

Abstract: Data fusion, the process of combining observational and experimental data, can enable the identification of causal effects that would otherwise remain non-identifiable. Although identification algorithms have been developed for specific scenarios, do-calculus remains the only general-purpose tool for causal data fusion, particularly when variables are present in some data sources but not others. H… ▽ More Data fusion, the process of combining observational and experimental data, can enable the identification of causal effects that would otherwise remain non-identifiable. Although identification algorithms have been developed for specific scenarios, do-calculus remains the only general-purpose tool for causal data fusion, particularly when variables are present in some data sources but not others. However, approaches based on do-calculus may encounter computational challenges as the number of variables increases and the causal graph grows in complexity. Consequently, there exists a need to reduce the size of such models while preserving the essential features. For this purpose, we propose pruning (removing unnecessary variables) and clustering (combining variables) as preprocessing operations for causal data fusion. We generalize earlier results on a single data source and derive conditions for applying pruning and clustering in the case of multiple data sources. We give sufficient conditions for inferring the identifiability or non-identifiability of a causal effect in a larger graph based on a smaller graph and show how to obtain the corresponding identifying functional for identifiable causal effects. Examples from epidemiology and social science demonstrate the use of the results. △ Less

Submitted 21 May, 2025; originally announced May 2025.

arXiv:2411.03848 [pdf, ps, other]

Monotone Missing Data: A Blessing and a Curse

Authors: Santtu Tikka, Juha Karvanen

Abstract: Monotone missingness is commonly encountered in practice where a missing measurement compels another measurement to be missing. In graphical missing data models, monotonicity has implications for the identifiability of the full law, i.e., the joint distribution of actual variables and response indicators. In the general nonmonotone case, the full law is known to be nonparametrically identifiable i… ▽ More Monotone missingness is commonly encountered in practice where a missing measurement compels another measurement to be missing. In graphical missing data models, monotonicity has implications for the identifiability of the full law, i.e., the joint distribution of actual variables and response indicators. In the general nonmonotone case, the full law is known to be nonparametrically identifiable if and only if neither colluders nor self-censoring edges are present in the graph. We show that monotonicity may enable the identification of the full law despite colluders and prevent the identification under mediated (pathwise) self-censoring. The results emphasize the importance of proper treatment of monotone missingness in the analysis of incomplete data. △ Less

Submitted 6 November, 2024; originally announced November 2024.

arXiv:2403.02245 [pdf, ps, other]

Dynamic programming principle in cost-efficient sequential design: application to switching measurements

Authors: Jeongmin Han, Juha Karvanen, Mikko Parviainen

Abstract: We study sequential cost-efficient design in a situation where each update of covariates involves a fixed time cost typically considerable compared to a single measurement time. The problem arises from parameter estimation in switching measurements on superconducting Josephson junctions which are components needed in quantum computers and other superconducting electronics. In switching measurement… ▽ More We study sequential cost-efficient design in a situation where each update of covariates involves a fixed time cost typically considerable compared to a single measurement time. The problem arises from parameter estimation in switching measurements on superconducting Josephson junctions which are components needed in quantum computers and other superconducting electronics. In switching measurements, a sequence of current pulses is applied to the junction and a binary voltage response is observed. The measurement requires a very low temperature that can be kept stable only for a relatively short time, and therefore it is essential to use an efficient design. We use the dynamic programming principle from the mathematical theory of optimal control to solve the optimal update times. Our simulations demonstrate the cost-efficiency compared to the previously used methods. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: 28 pages, 3 figures

arXiv:2402.05633 [pdf, ps, other]

Full Law Identification under Missing Data with Categorical Variables

Authors: Santtu Tikka, Juha Karvanen

Abstract: Missing data may be disastrous for the identifiability of causal and statistical estimands. In graphical missing data models, colluders are dependence structures that have a special importance for identification considerations. It has been shown that the presence of a colluder makes the full law, i.e., the joint distribution of variables and response indicators, non-parametrically non-identifiable… ▽ More Missing data may be disastrous for the identifiability of causal and statistical estimands. In graphical missing data models, colluders are dependence structures that have a special importance for identification considerations. It has been shown that the presence of a colluder makes the full law, i.e., the joint distribution of variables and response indicators, non-parametrically non-identifiable. However, when the variables related to the colluder structure are categorical, it is sometimes possible to regain the identifiability of the full law. We present a necessary and sufficient condition for the identification of the full law in the presence of colluder structures with arbitrary categorical variables. Maximum likelihood estimation of the full law in identifiable models with categorical variables is demonstrated with simulated and real data. △ Less

Submitted 3 July, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

arXiv:2306.15328 [pdf, ps, other]

doi 10.1613/jair.1.15579

Simulating counterfactuals

Authors: Juha Karvanen, Santtu Tikka, Matti Vihola

Abstract: Counterfactual inference considers a hypothetical intervention in a parallel world that shares some evidence with the factual world. If the evidence specifies a conditional distribution on a manifold, counterfactuals may be analytically intractable. We present an algorithm for simulating values from a counterfactual distribution where conditions can be set on both discrete and continuous variables… ▽ More Counterfactual inference considers a hypothetical intervention in a parallel world that shares some evidence with the factual world. If the evidence specifies a conditional distribution on a manifold, counterfactuals may be analytically intractable. We present an algorithm for simulating values from a counterfactual distribution where conditions can be set on both discrete and continuous variables. We show that the proposed algorithm can be presented as a particle filter leading to asymptotically valid inference. The algorithm is applied to fairness analysis in credit-scoring. △ Less

Submitted 26 March, 2024; v1 submitted 27 June, 2023; originally announced June 2023.

Journal ref: Journal of Artificial Intelligence Research 80, 835-857, 2024

arXiv:2303.16660 [pdf, other]

doi 10.1353/obs.2024.a929116

Price Optimization Combining Conjoint Data and Purchase History: A Causal Modeling Approach

Authors: Lauri Valkonen, Santtu Tikka, Jouni Helske, Juha Karvanen

Abstract: Pricing decisions of companies require an understanding of the causal effect of a price change on the demand. When real-life pricing experiments are infeasible, data-driven decision-making must be based on alternative data sources such as purchase history (sales data) and conjoint studies where a group of customers is asked to make imaginary purchases in an artificial setup. We present an approach… ▽ More Pricing decisions of companies require an understanding of the causal effect of a price change on the demand. When real-life pricing experiments are infeasible, data-driven decision-making must be based on alternative data sources such as purchase history (sales data) and conjoint studies where a group of customers is asked to make imaginary purchases in an artificial setup. We present an approach for price optimization that combines population statistics, purchase history and conjoint data in a systematic way. We build on the recent advances in causal inference to identify and quantify the effect of price on the purchase probability at the customer level. The identification task is a transportability problem whose solution requires a parametric assumption on the differences between the conjoint study and real purchases. The causal effect is estimated using Bayesian methods that take into account the uncertainty of the data sources. The pricing decision is made by comparing the estimated posterior distributions of gross profit for different prices. The approach is demonstrated with simulated data resembling the features of real-world data. △ Less

Submitted 30 April, 2024; v1 submitted 29 March, 2023; originally announced March 2023.

Journal ref: Observational Studies, 10(1), 37-53, 2024

arXiv:2206.06699 [pdf, ps, other]

Generalizing experimental findings: identification beyond adjustments

Authors: Juha Karvanen

Abstract: We aim to generalize the results of a randomized controlled trial (RCT) to a target population with the help of some observational data. This is a problem of causal effect identification with multiple data sources. Challenges arise when the RCT is conducted in a context that differs from the target population. Earlier research has focused on cases where the estimates from the RCT can be adjusted b… ▽ More We aim to generalize the results of a randomized controlled trial (RCT) to a target population with the help of some observational data. This is a problem of causal effect identification with multiple data sources. Challenges arise when the RCT is conducted in a context that differs from the target population. Earlier research has focused on cases where the estimates from the RCT can be adjusted by observational data in order to remove the selection bias and other domain specific differences. We consider examples where the experimental findings cannot be generalized by an adjustment and show that the generalization may still be possible by other identification strategies that can be derived by applying do-calculus. The obtained identifying functionals for these examples contain trapdoor variables of a new type. The value of a trapdoor variable needs to be fixed in the estimation and the choice of the value may have a major effect on the bias and accuracy of estimates, which is also seen in simulations. The presented results expand the scope of settings where the generalization of experimental findings is doable △ Less

Submitted 14 June, 2022; originally announced June 2022.

MSC Class: 62D20; 62H12; 62H22

arXiv:2111.15233 [pdf, other]

Contrasting Identifying Assumptions of Average Causal Effects: Robustness and Semiparametric Efficiency

Authors: Tetiana Gorbach, Xavier de Luna, Juha Karvanen, Ingeborg Waernbaum

Abstract: Semiparametric inference on average causal effects from observational data is based on assumptions yielding identification of the effects. In practice, several distinct identifying assumptions may be plausible; an analyst has to make a delicate choice between these models. In this paper, we study three identifying assumptions based on the potential outcome framework: the back-door assumption, whic… ▽ More Semiparametric inference on average causal effects from observational data is based on assumptions yielding identification of the effects. In practice, several distinct identifying assumptions may be plausible; an analyst has to make a delicate choice between these models. In this paper, we study three identifying assumptions based on the potential outcome framework: the back-door assumption, which uses pre-treatment covariates, the front-door assumption, which uses mediators, and the two-door assumption using pre-treatment covariates and mediators simultaneously. We provide the efficient influence functions and the corresponding semiparametric efficiency bounds that hold under these assumptions, and their combinations. We demonstrate that neither of the identification models provides uniformly the most efficient estimation and give conditions under which some bounds are lower than others. We show when semiparametric estimating equation estimators based on influence functions attain the bounds, and study the robustness of the estimators to misspecification of the nuisance models. The theory is complemented with simulation experiments on the finite sample behavior of the estimators. The results obtained are relevant for an analyst facing a choice between several plausible identifying assumptions and corresponding estimators. Our results show that this choice implies a trade-off between efficiency and robustness to misspecification of the nuisance models. △ Less

Submitted 17 February, 2023; v1 submitted 30 November, 2021; originally announced November 2021.

Journal ref: Journal of Machine Learning Research 24 (197), 1-65, 2023

arXiv:2111.04513 [pdf, ps, other]

Clustering and Structural Robustness in Causal Diagrams

Authors: Santtu Tikka, Jouni Helske, Juha Karvanen

Abstract: Graphs are commonly used to represent and visualize causal relations. For a small number of variables, this approach provides a succinct and clear view of the scenario at hand. As the number of variables under study increases, the graphical approach may become impractical, and the clarity of the representation is lost. Clustering of variables is a natural way to reduce the size of the causal diagr… ▽ More Graphs are commonly used to represent and visualize causal relations. For a small number of variables, this approach provides a succinct and clear view of the scenario at hand. As the number of variables under study increases, the graphical approach may become impractical, and the clarity of the representation is lost. Clustering of variables is a natural way to reduce the size of the causal diagram, but it may erroneously change the essential properties of the causal relations if implemented arbitrarily. We define a specific type of cluster, called transit cluster, that is guaranteed to preserve the identifiability properties of causal effects under certain conditions. We provide a sound and complete algorithm for finding all transit clusters in a given graph and demonstrate how clustering can simplify the identification of causal effects. We also study the inverse problem, where one starts with a clustered graph and looks for extended graphs where the identifiability properties of causal effects remain unchanged. We show that this kind of structural robustness is closely related to transit clusters. △ Less

Submitted 15 August, 2023; v1 submitted 8 November, 2021; originally announced November 2021.

Comments: This is the version published in JMLR

Journal ref: Journal of Machine Learning Research, 24(195):1-32, 2023

arXiv:2008.13558 [pdf, other]

Simulation Framework for Realistic Large-scale Individual-level Data Generation with an Application in the Health Domain

Authors: Santtu Tikka, Jussi Hakanen, Mirka Saarela, Juha Karvanen

Abstract: We propose a framework for realistic data generation and simulation of complex systems and demonstrate its capabilities in the health domain. The main use cases of the framework are predicting the development of risk factors and disease occurrence, evaluating the impact of interventions and policy decisions, and statistical method development. We present the fundamentals of the framework using rig… ▽ More We propose a framework for realistic data generation and simulation of complex systems and demonstrate its capabilities in the health domain. The main use cases of the framework are predicting the development of risk factors and disease occurrence, evaluating the impact of interventions and policy decisions, and statistical method development. We present the fundamentals of the framework using rigorous mathematical definitions. The framework supports calibration to a real population as well as various manipulations and data collection processes. The freely available open-source implementation in R embraces efficient data structures, parallel computing and fast random number generation which ensure reproducibility and scalability. With the framework it is possible to run daily-level simulations for populations of millions of individuals for decades of simulated time. An example on the occurrence of stroke, type 2 diabetes and mortality illustrates the usage of the framework in the Finnish context. In the example, we demonstrate the data-collection functionality by studying the impact of non-participation on the estimated risk models and interventions related to controlling the additional salt intake. △ Less

Submitted 5 June, 2021; v1 submitted 31 August, 2020; originally announced August 2020.

arXiv:2007.08189 [pdf, ps, other]

doi 10.1097/EDE.0000000000001270

Do-search -- a tool for causal inference and study design with multiple data sources

Authors: Juha Karvanen, Santtu Tikka, Antti Hyttinen

Abstract: Epidemiological evidence is based on multiple data sources including clinical trials, cohort studies, surveys, registries and expert opinions. Merging information from different sources opens up new possibilities for the estimation of causal effects. We show how causal effects can be identified and estimated by combining experiments and observations in real and realistic scenarios. As a new tool,… ▽ More Epidemiological evidence is based on multiple data sources including clinical trials, cohort studies, surveys, registries and expert opinions. Merging information from different sources opens up new possibilities for the estimation of causal effects. We show how causal effects can be identified and estimated by combining experiments and observations in real and realistic scenarios. As a new tool, we present do-search, a recently developed algorithmic approach that can determine the identifiability of a causal effect. The approach is based on do-calculus, and it can utilize data with non-trivial missing data and selection bias mechanisms. When the effect is identifiable, do-search outputs an identifying formula on which numerical estimation can be based. When the effect is not identifiable, we can use do-search to recognize additional data sources and assumptions that would make the effect identifiable. Throughout the paper, we consider the effect of salt-adding behavior on blood pressure mediated by the salt intake as an example. The identifiability of this effect is resolved in various scenarios with different assumptions on confounding. There are scenarios where the causal effect is identifiable from a chain of experiments but not from survey data, as well as scenarios where the opposite is true. As an illustration, we use survey data from NHANES 2013--2016 and the results from a meta-analysis of randomized controlled trials and estimate the reduction in average systolic blood pressure under an intervention where the use of table salt is discontinued. △ Less

Submitted 16 July, 2020; originally announced July 2020.

Journal ref: Epidemiology, 32(1), 111-119, 2020

arXiv:2003.03187 [pdf, other]

doi 10.1111/rssa.12699

Estimation of causal effects with small data in the presence of trapdoor variables

Authors: Jouni Helske, Santtu Tikka, Juha Karvanen

Abstract: We consider the problem of estimating causal effects of interventions from observational data when well-known back-door and front-door adjustments are not applicable. We show that when an identifiable causal effect is subject to an implicit functional constraint that is not deducible from conditional independence relations, the estimator of the causal effect can exhibit bias in small samples. This… ▽ More We consider the problem of estimating causal effects of interventions from observational data when well-known back-door and front-door adjustments are not applicable. We show that when an identifiable causal effect is subject to an implicit functional constraint that is not deducible from conditional independence relations, the estimator of the causal effect can exhibit bias in small samples. This bias is related to variables that we call trapdoor variables. We use simulated data to study different strategies to account for trapdoor variables and suggest how the related trapdoor bias might be minimized. The importance of trapdoor variables in causal effect estimation is illustrated with real data from the Life Course 1971-2002 study. Using this dataset, we estimate the causal effect of education on income in the Finnish context. Bayesian modelling allows us to take the parameter uncertainty into account and to present the estimated causal effects as posterior distributions. △ Less

Submitted 24 March, 2021; v1 submitted 6 March, 2020; originally announced March 2020.

Comments: 25 pages, 8 figures

Journal ref: Journal of Royal Statistical Society: Series A. 2021, 184:1030-1051

arXiv:1902.01073 [pdf, other]

doi 10.18637/jss.v099.i05

Causal Effect Identification from Multiple Incomplete Data Sources: A General Search-based Approach

Authors: Santtu Tikka, Antti Hyttinen, Juha Karvanen

Abstract: Causal effect identification considers whether an interventional probability distribution can be uniquely determined without parametric assumptions from measured source distributions and structural knowledge on the generating system. While complete graphical criteria and procedures exist for many identification problems, there are still challenging but important extensions that have not been consi… ▽ More Causal effect identification considers whether an interventional probability distribution can be uniquely determined without parametric assumptions from measured source distributions and structural knowledge on the generating system. While complete graphical criteria and procedures exist for many identification problems, there are still challenging but important extensions that have not been considered in the literature. To tackle these new settings, we present a search algorithm directly over the rules of do-calculus. Due to generality of do-calculus, the search is capable of taking more advanced data-generating mechanisms into account along with an arbitrary type of both observational and experimental source distributions. The search is enhanced via a heuristic and search space reduction techniques. The approach, called do-search, is provably sound, and it is complete with respect to identifiability problems that have been shown to be completely characterized by do-calculus. When extended with additional rules, the search is capable of handling missing data problems as well. With the versatile search, we are able to approach new problems such as combined transportability and selection bias, or multiple sources of selection bias. We perform a systematic analysis of bivariate missing data problems and study causal inference under case-control design. We also present the R package dosearch that provides an interface for a C++ implementation of the search. △ Less

Submitted 27 August, 2021; v1 submitted 4 February, 2019; originally announced February 2019.

Comments: This is the version published in the Journal of Statistical Software

Journal ref: Journal of Statistical Software, 99(5):1-40, 2021

arXiv:1806.07172 [pdf, ps, other]

doi 10.1016/j.ijar.2019.02.007

Surrogate Outcomes and Transportability

Authors: Santtu Tikka, Juha Karvanen

Abstract: Identification of causal effects is one of the most fundamental tasks of causal inference. We consider an identifiability problem where some experimental and observational data are available but neither data alone is sufficient for the identification of the causal effect of interest. Instead of the outcome of interest, surrogate outcomes are measured in the experiments. This problem is a generaliz… ▽ More Identification of causal effects is one of the most fundamental tasks of causal inference. We consider an identifiability problem where some experimental and observational data are available but neither data alone is sufficient for the identification of the causal effect of interest. Instead of the outcome of interest, surrogate outcomes are measured in the experiments. This problem is a generalization of identifiability using surrogate experiments and we label it as surrogate outcome identifiability. We show that the concept of transportability provides a sufficient criteria for determining surrogate outcome identifiability for a large class of queries. △ Less

Submitted 12 March, 2019; v1 submitted 19 June, 2018; originally announced June 2018.

Comments: This is the version published in the International Journal of Approximate Reasoning

Journal ref: International Journal of Approximate Reasoning, 2019; 108: 21-37

arXiv:1806.07161 [pdf, other]

doi 10.18637/jss.v076.i12

Identifying Causal Effects with the R Package causaleffect

Authors: Santtu Tikka, Juha Karvanen

Abstract: Do-calculus is concerned with estimating the interventional distribution of an action from the observed joint probability distribution of the variables in a given causal structure. All identifiable causal effects can be derived using the rules of do-calculus, but the rules themselves do not give any direct indication whether the effect in question is identifiable or not. Shpitser and Pearl constru… ▽ More Do-calculus is concerned with estimating the interventional distribution of an action from the observed joint probability distribution of the variables in a given causal structure. All identifiable causal effects can be derived using the rules of do-calculus, but the rules themselves do not give any direct indication whether the effect in question is identifiable or not. Shpitser and Pearl constructed an algorithm for identifying joint interventional distributions in causal models, which contain unobserved variables and induce directed acyclic graphs. This algorithm can be seen as a repeated application of the rules of do-calculus and known properties of probabilities, and it ultimately either derives an expression for the causal distribution, or fails to identify the effect, in which case the effect is non-identifiable. In this paper, the R package causaleffect is presented, which provides an implementation of this algorithm. Functionality of causaleffect is also demonstrated through examples. △ Less

Submitted 19 June, 2018; originally announced June 2018.

Comments: This is the version published in the Journal of Statistical Software

Journal ref: Journal of Statistical Software, 76(12):1-30, 2017

arXiv:1806.07085 [pdf, ps, other]

Enhancing Identification of Causal Effects by Pruning

Authors: Santtu Tikka, Juha Karvanen

Abstract: Causal models communicate our assumptions about causes and effects in real-world phe- nomena. Often the interest lies in the identification of the effect of an action which means deriving an expression from the observed probability distribution for the interventional distribution resulting from the action. In many cases an identifiability algorithm may return a complicated expression that contains… ▽ More Causal models communicate our assumptions about causes and effects in real-world phe- nomena. Often the interest lies in the identification of the effect of an action which means deriving an expression from the observed probability distribution for the interventional distribution resulting from the action. In many cases an identifiability algorithm may return a complicated expression that contains variables that are in fact unnecessary. In practice this can lead to additional computational burden and increased bias or inefficiency of estimates when dealing with measurement error or missing data. We present graphical criteria to detect variables which are redundant in identifying causal effects. We also provide an improved version of a well-known identifiability algorithm that implements these criteria. △ Less

Submitted 19 June, 2018; originally announced June 2018.

Comments: This is the version published in JMLR

Journal ref: Journal of Machine Learning Research (JMLR), 18(194):1-23, 2018

arXiv:1806.07082 [pdf, other]

Simplifying Probabilistic Expressions in Causal Inference

Authors: Santtu Tikka, Juha Karvanen

Abstract: Obtaining a non-parametric expression for an interventional distribution is one of the most fundamental tasks in causal inference. Such an expression can be obtained for an identifiable causal effect by an algorithm or by manual application of do-calculus. Often we are left with a complicated expression which can lead to biased or inefficient estimates when missing data or measurement errors are i… ▽ More Obtaining a non-parametric expression for an interventional distribution is one of the most fundamental tasks in causal inference. Such an expression can be obtained for an identifiable causal effect by an algorithm or by manual application of do-calculus. Often we are left with a complicated expression which can lead to biased or inefficient estimates when missing data or measurement errors are involved. We present an automatic simplification algorithm that seeks to eliminate symbolically unnecessary variables from these expressions by taking advantage of the structure of the underlying graphical model. Our method is applicable to all causal effect formulas and is readily available in the R package causaleffect. △ Less

Submitted 19 June, 2018; originally announced June 2018.

Comments: This is the version published in JMLR

Journal ref: Journal of Machine Learning Research (JMLR), 18(36):1-30, 2017

arXiv:1711.06070 [pdf, ps, other]

doi 10.1177/1403494817734774

Adjusting for selective non-participation with re-contact data in the FINRISK 2012 survey

Authors: Juho Kopra, Tommi Härkänen, Hanna Tolonen, Pekka Jousilahti, Kari Kuulasmaa, Jaakko Reinikainen, Juha Karvanen

Abstract: Aims: A common objective of epidemiological surveys is to provide population-level estimates of health indicators. Survey results tend to be biased under selective non-participation. One approach to bias reduction is to collect information about non-participants by contacting them again and asking them to fill in a questionnaire. This information is called re-contact data, and it allows to adjust… ▽ More Aims: A common objective of epidemiological surveys is to provide population-level estimates of health indicators. Survey results tend to be biased under selective non-participation. One approach to bias reduction is to collect information about non-participants by contacting them again and asking them to fill in a questionnaire. This information is called re-contact data, and it allows to adjust the estimates for non-participation. Methods: We analyse data from the FINRISK 2012 survey, where re-contact data were collected. We assume that the respondents of the re-contact survey are similar to the remaining non-participants with respect to the health given their available background information. Validity of this assumption is evaluated based on the hospitalization data obtained through record linkage of survey data to the administrative registers. Using this assumption and multiple imputation, we estimate the prevalences of daily smoking and heavy alcohol consumption and compare them to estimates obtained with a commonly used assumption that the participants represent the entire target group. Results: This approach produces higher prevalence estimates than what is estimated from participants only. Among men, smoking prevalence estimate was 28.5% (23.2% for participants), heavy alcohol consumption prevalence was 9.4% (6.8% for participants). Among women, smoking prevalence was 19.0% (16.5% for participants) and heavy alcohol consumption 4.8% (3.0% for participants). Conclusion: Utilization of re-contact data is a useful method to adjust for non-participation bias on population estimates in epidemiological surveys. △ Less

Submitted 16 November, 2017; originally announced November 2017.

Comments: 16 pages, 4 tables, 0 figures

Journal ref: Scandinavian Journal of Public Health, 2017

arXiv:1610.03687 [pdf, other]

Bayesian models for data missing not at random in health examination surveys

Authors: Juho Kopra, Juha Karvanen, Tommi Härkänen

Abstract: In epidemiological surveys, data missing not at random (MNAR) due to survey nonresponse may potentially lead to a bias in the risk factor estimates. We propose an approach based on Bayesian data augmentation and survival modelling to reduce the nonresponse bias. The approach requires additional information based on follow-up data. We present a case study of smoking prevalence using FINRISK data co… ▽ More In epidemiological surveys, data missing not at random (MNAR) due to survey nonresponse may potentially lead to a bias in the risk factor estimates. We propose an approach based on Bayesian data augmentation and survival modelling to reduce the nonresponse bias. The approach requires additional information based on follow-up data. We present a case study of smoking prevalence using FINRISK data collected between 1972 and 2007 with a follow-up to the end of 2012 and compare it to other commonly applied missing at random (MAR) imputation approaches. A simulation experiment is carried out to study the validity of the approaches. Our approach appears to reduce the nonresponse bias substantially, where as MAR imputation was not successful in bias reduction. △ Less

Submitted 28 August, 2017; v1 submitted 12 October, 2016; originally announced October 2016.

Comments: 19 pages, 2 figures

arXiv:1609.08347 [pdf, ps, other]

Optimal design of observational studies: overview and synthesis

Authors: Juha Karvanen, Jarno Vanhatalo, Kari Auranen, Sangita Kulathinal, Samu Mäntyniemi

Abstract: We review typical design problems encountered in the planning of observational studies and propose a unifying framework that allows us to use the same concepts and notation for different problems. In the framework, the design is defined as a probability measure in the space of observational processes that determine whether the value of a variable is observed for a specific unit at the given time.… ▽ More We review typical design problems encountered in the planning of observational studies and propose a unifying framework that allows us to use the same concepts and notation for different problems. In the framework, the design is defined as a probability measure in the space of observational processes that determine whether the value of a variable is observed for a specific unit at the given time. The optimal design is then defined, according to Bayesian decision theory, to be the one that maximizes the expected utility related to the design. We present examples on the use of the framework and discuss methods for deriving optimal or approximately optimal designs. △ Less

Submitted 1 November, 2017; v1 submitted 27 September, 2016; originally announced September 2016.

Comments: Submitted

arXiv:1609.01547 [pdf, ps, other]

doi 10.1111/stan.12264

Bayesian subcohort selection for longitudinal covariate measurements in follow-up studies

Authors: Jaakko Reinikainen, Juha Karvanen

Abstract: We consider planning longitudinal covariate measurements in follow-up studies where covariates are time-varying. We assume that the entire cohort cannot be selected for longitudinal measurements due to financial limitations and study how a subset of the cohort should be selected optimally in order to obtain precise estimates of covariate effects in a survival model. In our approach, the study will… ▽ More We consider planning longitudinal covariate measurements in follow-up studies where covariates are time-varying. We assume that the entire cohort cannot be selected for longitudinal measurements due to financial limitations and study how a subset of the cohort should be selected optimally in order to obtain precise estimates of covariate effects in a survival model. In our approach, the study will be designed sequentially utilizing the data collected in previous measurements of the individuals as prior information. We propose using a Bayesian optimality criterion in the subcohort selections, which is compared with simple random sampling using simulated and real follow-up data. This study extends previous results where optimal subcohort selection was studied with only one re-measurement and one covariate, to more realistic cases where several covariates and measurement points are allowed. Our results support the conclusion that the precision of the estimates can be clearly improved by optimal design. △ Less

Submitted 6 September, 2016; originally announced September 2016.

Journal ref: Statistica Neerlandica, 76(4), 372-390, 2022

arXiv:1608.02333 [pdf, ps, other]

doi 10.1002/bimj.201600067

Prioritizing covariates in the planning of future studies in the meta-analytic framework

Authors: Juha Karvanen, Mikko J. Sillanpää

Abstract: Science can be seen as a sequential process where each new study augments evidence to the existing knowledge. To have the best prospects to make an impact in this process, a new study should be designed optimally taking into account the previous studies and other prior information. We propose a formal approach for the covariate prioritization, i.e., the decision about the covariates to be measured… ▽ More Science can be seen as a sequential process where each new study augments evidence to the existing knowledge. To have the best prospects to make an impact in this process, a new study should be designed optimally taking into account the previous studies and other prior information. We propose a formal approach for the covariate prioritization, i.e., the decision about the covariates to be measured in a new study. The decision criteria can be based on conditional power, change of the p-value, change in lower confidence limit, Kullback-Leibler divergence, Bayes factors, Bayesian false discovery rate or difference between prior and posterior expectation. The criteria can be also used for decisions on the sample size. As an illustration, we consider covariate prioritization based on genome-wide association studies for C-reactive protein levels and make suggestions on the genes to be studied further. keywords: design; evidence-based medicine; meta-analysis; power; scientific method △ Less

Submitted 8 August, 2016; originally announced August 2016.

Journal ref: Biometrical Journal, Volume 59, Issue 1, Pages 110-125, 2017

arXiv:1502.03609 [pdf, other]

doi 10.1002/sta4.73

Correcting for non-ignorable missingness in smoking trends

Authors: Juho Kopra, Tommi Härkänen, Hanna Tolonen, Juha Karvanen

Abstract: Data missing not at random (MNAR) is a major challenge in survey sampling. We propose an approach based on registry data to deal with non-ignorable missingness in health examination surveys. The approach relies on follow-up data available from administrative registers several years after the survey. For illustration we use data on smoking prevalence in Finnish National FINRISK study conducted in 1… ▽ More Data missing not at random (MNAR) is a major challenge in survey sampling. We propose an approach based on registry data to deal with non-ignorable missingness in health examination surveys. The approach relies on follow-up data available from administrative registers several years after the survey. For illustration we use data on smoking prevalence in Finnish National FINRISK study conducted in 1972-1997. The data consist of measured survey information including missingness indicators, register-based background information and register-based time-to-disease survival data. The parameters of missingness mechanism are estimable with these data although the original survey data are MNAR. The underlying data generation process is modelled by a Bayesian model. The results indicate that the estimated smoking prevalence rates in Finland may be significantly affected by missing data. △ Less

Submitted 12 February, 2015; originally announced February 2015.

Comments: in Stat, 2015

arXiv:1403.1124 [pdf, ps, other]

Estimating complex causal effects from incomplete observational data

Authors: Juha Karvanen

Abstract: Despite the major advances taken in causal modeling, causality is still an unfamiliar topic for many statisticians. In this paper, it is demonstrated from the beginning to the end how causal effects can be estimated from observational data assuming that the causal structure is known. To make the problem more challenging, the causal effects are highly nonlinear and the data are missing at random. T… ▽ More Despite the major advances taken in causal modeling, causality is still an unfamiliar topic for many statisticians. In this paper, it is demonstrated from the beginning to the end how causal effects can be estimated from observational data assuming that the causal structure is known. To make the problem more challenging, the causal effects are highly nonlinear and the data are missing at random. The tools used in the estimation include causal models with design, causal calculus, multiple imputation and generalized additive models. The main message is that a trained statistician can estimate causal effects by judiciously combining existing tools. △ Less

Submitted 2 July, 2014; v1 submitted 5 March, 2014; originally announced March 2014.

arXiv:1304.5380 [pdf, ps, other]

doi 10.1007/s11129-014-9148-4

Survey data and Bayesian analysis: a cost-efficient way to estimate customer equity

Authors: Juha Karvanen, Ari Rantanen, Lasse Luoma

Abstract: We present a Bayesian framework for estimating the customer lifetime value (CLV) and the customer equity (CE) based on the purchasing behavior deducible from the market surveys on customer purchasing behavior. The proposed framework systematically addresses the challenges faced when the future value of customers is estimated based on survey data. The scarcity of the survey data and the sampling va… ▽ More We present a Bayesian framework for estimating the customer lifetime value (CLV) and the customer equity (CE) based on the purchasing behavior deducible from the market surveys on customer purchasing behavior. The proposed framework systematically addresses the challenges faced when the future value of customers is estimated based on survey data. The scarcity of the survey data and the sampling variance are countered by utilizing the prior information and quantifying the uncertainty of the CE and CLV estimates by posterior distributions. Furthermore, information on the purchase behavior of the customers of competitors available in the survey data is integrated to the framework. The introduced approach is directly applicable in the domains where a customer relationship can be thought to be monogamous. As an example on the use of the framework, we analyze a consumer survey on mobile phones carried out in Finland in February 2013. The survey data contains consumer given information on the current and previous brand of the phone and the times of the last two purchases. △ Less

Submitted 30 May, 2014; v1 submitted 19 April, 2013; originally announced April 2013.

MSC Class: 62N02; 62-07; 62F15 ACM Class: G.3; J.1

Journal ref: Quantitative Marketing and Economics, Volume 12, Issue 3, Pages 305-329, 2014

arXiv:1211.2958 [pdf, ps, other]

doi 10.1111/sjos.12110

Study design in causal models

Authors: Juha Karvanen

Abstract: The causal assumptions, the study design and the data are the elements required for scientific inference in empirical research. The research is adequately communicated only if all of these elements and their relations are described precisely. Causal models with design describe the study design and the missing data mechanism together with the causal structure and allow the direct application of cau… ▽ More The causal assumptions, the study design and the data are the elements required for scientific inference in empirical research. The research is adequately communicated only if all of these elements and their relations are described precisely. Causal models with design describe the study design and the missing data mechanism together with the causal structure and allow the direct application of causal calculus in the estimation of the causal effects. The flow of the study is visualized by ordering the nodes of the causal diagram in two dimensions by their causal order and the time of the observation. Conclusions whether a causal or observational relationship can be estimated from the collected incomplete data can be made directly from the graph. Causal models with design offer a systematic and unifying view scientific inference and increase the clarity and speed of communication. Examples on the causal models for a case-control study, a nested case-control study, a clinical trial and a two-stage case-cohort study are presented. △ Less

Submitted 24 April, 2014; v1 submitted 13 November, 2012; originally announced November 2012.

Comments: The example on the MORGAM Project extended is in this version

MSC Class: 62A01; 62-09; 62F99; 62D05; 62P10; 62K99; 68T30 ACM Class: G.3; G.2.2

Journal ref: Scandinavian Journal of Statistics, Volume 42, Issue 2, pages 361-377, 2015

arXiv:math/0701405 [pdf, ps, other]

doi 10.1016/j.csda.2007.06.021

Characterizing the generalized lambda distribution by L-moments

Authors: Juha Karvanen, Arto Nuutinen

Abstract: The generalized lambda distribution (GLD) is a flexible four parameter distribution with many practical applications. L-moments of the GLD can be expressed in closed form and are good alternatives for the central moments. The L-moments of the GLD up to an arbitrary order are presented, and a study of L-skewness and L-kurtosis that can be achieved by the GLD is provided. The boundaries of L-skewn… ▽ More The generalized lambda distribution (GLD) is a flexible four parameter distribution with many practical applications. L-moments of the GLD can be expressed in closed form and are good alternatives for the central moments. The L-moments of the GLD up to an arbitrary order are presented, and a study of L-skewness and L-kurtosis that can be achieved by the GLD is provided. The boundaries of L-skewness and L-kurtosis are derived analytically for the symmetric GLD and calculated numerically for the GLD in general. Additionally, the contours of L-skewness and L-kurtosis are presented as functions of the GLD parameters. It is found that with an exception of the smallest values of L-kurtosis, the GLD covers all possible pairs of L-skewness and L-kurtosis and often there are two or more distributions that share the same L-skewness and the same L-kurtosis. Examples that demonstrate situations where there are four GLD members with the same L-skewness and the same L-kurtosis are presented. The estimation of the GLD parameters is studied in a simulation example where method of L-moments compares favorably to more complicated estimation methods. The results increase the knowledge on the distributions that belong to the GLD family and can be utilized in model selection and estimation. △ Less

Submitted 26 June, 2007; v1 submitted 15 January, 2007; originally announced January 2007.

Comments: Revised version, accepted for publication

MSC Class: 60E05; 62E10; 62G30

Journal ref: Computational Statistics & Data Analysis 2008, Vol. 52, 1971-1983

arXiv:math/0611017 [pdf, ps, other]

doi 10.1016/j.stamet.2007.11.002

Efficient initial designs for binary response data

Authors: Juha Karvanen

Abstract: In this paper we introduce a binary search algorithm that efficiently finds initial maximum likelihood estimates for sequential experiments where a binary response is modeled by a continuous factor. The problem is motivated by switching measurements on superconducting Josephson junctions. In this quantum mechanical experiment, the current is the factor controlled by the experimenter and a binary… ▽ More In this paper we introduce a binary search algorithm that efficiently finds initial maximum likelihood estimates for sequential experiments where a binary response is modeled by a continuous factor. The problem is motivated by switching measurements on superconducting Josephson junctions. In this quantum mechanical experiment, the current is the factor controlled by the experimenter and a binary response indicating the presence or the absence of a voltage response is measured. The prior knowledge on the model parameters is typically poor, which may cause the common approaches of initial estimation to fail. The binary search algorithm is designed to work reliably even when the prior information is very poor. The properties of the algorithm are studied in simulations and an advantage over the initial estimation with equally spaced factor levels is demonstrated. We also study the cost-efficiency of the binary search algorithm and find the approximately optimal number of measurements per stage when there is a cost related to the number of stages in the experiment. KEY WORDS: optimal design, binary search, logistic regression, complementary log-log, quantum physics, switching measurement △ Less

Submitted 6 February, 2008; v1 submitted 1 November, 2006; originally announced November 2006.

MSC Class: 62L05; 62K05; 62P35

Journal ref: Statistical Methodology 2008, Vol. 5, 462-473

arXiv:cond-mat/0610507 [pdf, ps, other]

doi 10.1111/j.1467-9876.2007.00572.x

Experimental Designs for Binary Data in Switching Measurements on Superconducting Josephson Junctions

Authors: Juha Karvanen, Juha J. Vartiainen, Andrey Timofeev, Jukka Pekola

Abstract: We study the optimal design of switching measurements of small Josephson junction circuits which operate in the macroscopic quantum tunnelling regime. Starting from the D-optimality criterion we derive the optimal design for the estimation of the unknown parameters of the underlying Gumbel type distribution. As a practical method for the measurements, we propose a sequential design that combines… ▽ More We study the optimal design of switching measurements of small Josephson junction circuits which operate in the macroscopic quantum tunnelling regime. Starting from the D-optimality criterion we derive the optimal design for the estimation of the unknown parameters of the underlying Gumbel type distribution. As a practical method for the measurements, we propose a sequential design that combines heuristic search for initial estimates and maximum likelihood estimation. The presented design has immediate applications in the area of superconducting electronics implying faster data acquisition. The presented experimental results confirm the usefulness of the method. KEY WORDS: optimal design, D-optimality, logistic regression, complementary log-log link, quantum physics, escape measurements △ Less

Submitted 18 October, 2006; originally announced October 2006.

Journal ref: Journal of the Royal Statistical Society: Series C (Applied Statistics) 2007, Vol. 56, 167-181

Showing 1–29 of 29 results for author: Karvanen, J