Search | arXiv e-print repository

Generalized entropy calibration for analyzing voluntary survey data

Authors: Yonghyun Kwon, Jae Kwang Kim, Yumou Qiu

Abstract: Statistical analysis of voluntary survey data is an important area of research in survey sampling. We consider a unified approach to voluntary survey data analysis under the assumption that the sampling mechanism is ignorable. Generalized entropy calibration is introduced as a unified tool for calibration weighting to control the selection bias. We first establish the relationship between the gene… ▽ More Statistical analysis of voluntary survey data is an important area of research in survey sampling. We consider a unified approach to voluntary survey data analysis under the assumption that the sampling mechanism is ignorable. Generalized entropy calibration is introduced as a unified tool for calibration weighting to control the selection bias. We first establish the relationship between the generalized calibration weighting and its dual expression for regression estimation. The dual relationship is critical in identifying the implied regression model and developing model selection for calibration weighting. Also, if a linear regression model for an important study variable is available, then two-step calibration method can be used to smooth the final weights and achieve the statistical efficiency. Asymptotic properties of the proposed estimator are investigated. Results from a limited simulation study are also presented. △ Less

Submitted 1 June, 2025; v1 submitted 16 December, 2024; originally announced December 2024.

arXiv:2404.01076 [pdf, other]

Debiased calibration estimation using generalized entropy in survey sampling

Authors: Yonghyun Kwon, Jae Kwang Kim, Yumou Qiu

Abstract: Incorporating the auxiliary information into the survey estimation is a fundamental problem in survey sampling. Calibration weighting is a popular tool for incorporating the auxiliary information. The calibration weighting method of Deville and Sarndal (1992) uses a distance measure between the design weights and the final weights to solve the optimization problem with calibration constraints. Thi… ▽ More Incorporating the auxiliary information into the survey estimation is a fundamental problem in survey sampling. Calibration weighting is a popular tool for incorporating the auxiliary information. The calibration weighting method of Deville and Sarndal (1992) uses a distance measure between the design weights and the final weights to solve the optimization problem with calibration constraints. This paper introduces a novel framework that leverages generalized entropy as the objective function for optimization, where design weights play a role in the constraints to ensure design consistency, rather than being part of the objective function. This innovative calibration framework is particularly attractive due to its generality and its ability to generate more efficient calibration weights compared to traditional methods based on Deville and Sarndal (1992). Furthermore, we identify the optimal choice of the generalized entropy function that achieves the minimum variance across various choices of the generalized entropy function under the same constraints. Asymptotic properties, such as design consistency and asymptotic normality, are presented rigorously. The results from a limited simulation study are also presented. We demonstrate a real-life application using agricultural survey data collected from Kynetec, Inc. △ Less

Submitted 11 May, 2025; v1 submitted 1 April, 2024; originally announced April 2024.

arXiv:2401.07625 [pdf, ps, other]

Statistics in Survey Sampling

Authors: Jae Kwang Kim

Abstract: Survey sampling theory and methods are introduced. Sampling designs and estimation methods are carefully discussed as a textbook for survey sampling. Topics includes Horvitz-Thompson estimation, simple random sampling, stratified sampling, cluster sampling, ratio estimation, regression estimation, variance estimation, two-phase sampling, and nonresponse adjustment methods. Survey sampling theory and methods are introduced. Sampling designs and estimation methods are carefully discussed as a textbook for survey sampling. Topics includes Horvitz-Thompson estimation, simple random sampling, stratified sampling, cluster sampling, ratio estimation, regression estimation, variance estimation, two-phase sampling, and nonresponse adjustment methods. △ Less

Submitted 1 December, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

arXiv:2307.11651 [pdf, other]

Multiple bias-calibration for adjusting selection bias of non-probability samples using data integration

Authors: Zhonglei Wang, Shu Yang, Jae Kwang Kim

Abstract: Valid statistical inference is challenging when the sample is subject to unknown selection bias. Data integration can be used to correct for selection bias when we have a parallel probability sample from the same population with some common measurements. How to model and estimate the selection probability or the propensity score (PS) of a non-probability sample using an independent probability sam… ▽ More Valid statistical inference is challenging when the sample is subject to unknown selection bias. Data integration can be used to correct for selection bias when we have a parallel probability sample from the same population with some common measurements. How to model and estimate the selection probability or the propensity score (PS) of a non-probability sample using an independent probability sample is the challenging part of the data integration. We approach this difficult problem by employing multiple candidate models for PS combined with empirical likelihood. By incorporating multiple propensity score models into the internal bias calibration constraint in the empirical likelihood setup, the selection bias can be eliminated so long as the multiple candidate models contain a true PS model. The bias calibration constraint under the multiple PS models is called multiple bias calibration. Multiple PS models can include both missing-at-random and missing-not-at-random models. Asymptotic properties are discussed, and some limited simulation studies are presented to compare the proposed method with some existing competitors. Plasmode simulation studies using the Culture \& Community in a Time of Crisis dataset demonstrate the practical usage and advantages of the proposed method. △ Less

Submitted 21 July, 2023; originally announced July 2023.

arXiv:2306.15173 [pdf, other]

Robust propensity score weighting estimation under missing at random

Authors: Hengfang Wang, Jae Kwang Kim, Jeongseop Han, Youngjo Lee

Abstract: Missing data is frequently encountered in many areas of statistics. Propensity score weighting is a popular method for handling missing data. The propensity score method employs a response propensity model, but correct specification of the statistical model can be challenging in the presence of missing data. Doubly robust estimation is attractive, as the consistency of the estimator is guaranteed… ▽ More Missing data is frequently encountered in many areas of statistics. Propensity score weighting is a popular method for handling missing data. The propensity score method employs a response propensity model, but correct specification of the statistical model can be challenging in the presence of missing data. Doubly robust estimation is attractive, as the consistency of the estimator is guaranteed when either the outcome regression model or the propensity score model is correctly specified. In this paper, we first employ information projection to develop an efficient and doubly robust estimator under indirect model calibration constraints. The resulting propensity score estimator can be equivalently expressed as a doubly robust regression imputation estimator by imposing the internal bias calibration condition in estimating the regression parameters. In addition, we generalize the information projection to allow for outlier-robust estimation. Some asymptotic properties are presented. The simulation study confirms that the proposed method allows robust inference against not only the violation of various model assumptions, but also outliers. A real-life application is presented using data from the Conservation Effects Assessment Project. △ Less

Submitted 27 March, 2024; v1 submitted 26 June, 2023; originally announced June 2023.

arXiv:2211.02998 [pdf, ps, other]

An empirical likelihood approach to reduce selection bias in voluntary samples

Authors: Jae Kwang Kim, Kosuke Morikawa

Abstract: We address the weighting problem in voluntary samples under a nonignorable sample selection model. Under the assumption that the sample selection model is correctly specified, we can compute a consistent estimator of the model parameter and construct the propensity score estimator of the population mean. We use the empirical likelihood method to construct the final weights for voluntary samples by… ▽ More We address the weighting problem in voluntary samples under a nonignorable sample selection model. Under the assumption that the sample selection model is correctly specified, we can compute a consistent estimator of the model parameter and construct the propensity score estimator of the population mean. We use the empirical likelihood method to construct the final weights for voluntary samples by incorporating the bias calibration constraints and the benchmarking constraints. Linearization variance estimation of the proposed method is developed. A limited simulation study is also performed to check the performance of the proposed methods. △ Less

Submitted 11 May, 2023; v1 submitted 5 November, 2022; originally announced November 2022.

arXiv:2208.07535 [pdf, other]

Semiparametric imputation using latent sparse conditional Gaussian mixtures for multivariate mixed outcomes

Authors: Shonosuke Sugasawa, Jae Kwang Kim, Kosuke Morikawa

Abstract: This paper proposes a flexible Bayesian approach to multiple imputation using conditional Gaussian mixtures. We introduce novel shrinkage priors for covariate-dependent mixing proportions in the mixture models to automatically select the suitable number of components used in the imputation step. We develop an efficient sampling algorithm for posterior computation and multiple imputation via Markov… ▽ More This paper proposes a flexible Bayesian approach to multiple imputation using conditional Gaussian mixtures. We introduce novel shrinkage priors for covariate-dependent mixing proportions in the mixture models to automatically select the suitable number of components used in the imputation step. We develop an efficient sampling algorithm for posterior computation and multiple imputation via Markov Chain Monte Carlo methods. The proposed method can be easily extended to the situation where the data contains not only continuous variables but also discrete variables such as binary and count values. We also propose approximate Bayesian inference for parameters defined by loss functions based on posterior predictive distributing of missing observations, by extending bootstrap-based Bayesian inference for complete data. The proposed method is demonstrated through numerical studies using simulated and real data. △ Less

Submitted 16 August, 2022; originally announced August 2022.

Comments: 29 pages, 5 figures

arXiv:2208.06039 [pdf, other]

Semiparametric adaptive estimation under informative sampling

Authors: Kosuke Morikawa, Yoshikazu Terada, Jae Kwang Kim

Abstract: In survey sampling, survey data do not necessarily represent the target population, and the samples are often biased. However, information on the survey weights aids in the elimination of selection bias. The Horvitz-Thompson estimator is a well-known unbiased, consistent, and asymptotically normal estimator; however, it is not efficient. Thus, this study derives the semiparametric efficiency bound… ▽ More In survey sampling, survey data do not necessarily represent the target population, and the samples are often biased. However, information on the survey weights aids in the elimination of selection bias. The Horvitz-Thompson estimator is a well-known unbiased, consistent, and asymptotically normal estimator; however, it is not efficient. Thus, this study derives the semiparametric efficiency bound for various target parameters by considering the survey weight as a random variable and consequently proposes a semiparametric optimal estimator with certain working models on the survey weights. The proposed estimator is consistent, asymptotically normal, and efficient in a class of the regular and asymptotically linear estimators. Further, a limited simulation study is conducted to investigate the finite sample performance of the proposed method. The proposed method is applied to the 1999 Canadian Workplace and Employee Survey data. △ Less

Submitted 3 April, 2024; v1 submitted 11 August, 2022; originally announced August 2022.

Comments: 21pages, 2 figures, and 2 tables

arXiv:2207.09891 [pdf, other]

Maximum Likelihood Imputation

Authors: Jeongseop Han, Youngjo Lee, Jae Kwang Kim

Abstract: Maximum likelihood (ML) estimation is widely used in statistics. The h-likelihood has been proposed as an extension of Fisher's likelihood to statistical models including unobserved latent variables of recent interest. Its advantage is that the joint maximization gives ML estimators (MLEs) of both fixed and random parameters with their standard error estimates. However, the current h-likelihood ap… ▽ More Maximum likelihood (ML) estimation is widely used in statistics. The h-likelihood has been proposed as an extension of Fisher's likelihood to statistical models including unobserved latent variables of recent interest. Its advantage is that the joint maximization gives ML estimators (MLEs) of both fixed and random parameters with their standard error estimates. However, the current h-likelihood approach does not allow MLEs of variance components as Henderson's joint likelihood does not in linear mixed models. In this paper, we show how to form the h-likelihood in order to facilitate joint maximization for MLEs of whole parameters. We also show the role of the Jacobian term which allows MLEs in the presence of unobserved latent variables. To obtain MLEs for fixed parameters, intractable integration is not necessary. As an illustration, we show one-shot ML imputation for missing data by treating them as realized but unobserved random parameters. We show that the h-likelihood bypasses the expectation step in the expectation-maximization (EM) algorithm and allows single ML imputation instead of multiple imputations. We also discuss the difference in predictions in random effects and missing data. △ Less

Submitted 20 July, 2022; originally announced July 2022.

arXiv:2206.01084 [pdf, other]

doi 10.1093/biomet/asad016

Soft calibration for selection bias problems under mixed-effects models

Authors: Chenyin Gao, Shu Yang, Jae Kwang Kim

Abstract: Calibration weighting has been widely used to correct selection biases in non-probability sampling, missing data, and causal inference. The main idea is to calibrate the biased sample to the benchmark by adjusting the subject weights. However, hard calibration can produce enormous weights when an exact calibration is enforced on a large set of extraneous covariates. This article proposes a soft ca… ▽ More Calibration weighting has been widely used to correct selection biases in non-probability sampling, missing data, and causal inference. The main idea is to calibrate the biased sample to the benchmark by adjusting the subject weights. However, hard calibration can produce enormous weights when an exact calibration is enforced on a large set of extraneous covariates. This article proposes a soft calibration scheme, in which the outcome and the selection indicator follow mixed-effects models. The scheme imposes an exact calibration on the fixed effects and an approximate calibration on the random effects. On the one hand, our soft calibration has an intrinsic connection with best linear unbiased prediction, which results in a more efficient estimation compared to hard calibration. On the other hand, soft calibration weighting estimation can be envisioned as penalized propensity score weight estimation, with the penalty term motivated by the mixed-effects structure. The asymptotic distribution and a valid variance estimator are derived for soft calibration. We demonstrate the superiority of the proposed estimator over other competitors in simulation studies and a real-data application. △ Less

Submitted 22 February, 2023; v1 submitted 2 June, 2022; originally announced June 2022.

Comments: Accepted for publication in Biometrika

arXiv:2204.09193 [pdf, other]

Functional Calibration under Non-Probability Survey Sampling

Authors: Zhonglei Wang, Xiaojun Mao, Jae Kwang Kim

Abstract: Non-probability sampling is prevailing in survey sampling, but ignoring its selection bias leads to erroneous inferences. We offer a unified nonparametric calibration method to estimate the sampling weights for a non-probability sample by calibrating functions of auxiliary variables in a reproducing kernel Hilbert space. The consistency and the limiting distribution of the proposed estimator are e… ▽ More Non-probability sampling is prevailing in survey sampling, but ignoring its selection bias leads to erroneous inferences. We offer a unified nonparametric calibration method to estimate the sampling weights for a non-probability sample by calibrating functions of auxiliary variables in a reproducing kernel Hilbert space. The consistency and the limiting distribution of the proposed estimator are established, and the corresponding variance estimator is also investigated. Compared with existing works, the proposed method is more robust since no parametric assumption is made for the selection mechanism of the non-probability sample. Numerical results demonstrate that the proposed method outperforms its competitors, especially when the model is misspecified. The proposed method is applied to analyze the average total cholesterol of Korean citizens based on a non-probability sample from the National Health Insurance Sharing Service and a reference probability sample from the Korea National Health and Nutrition Examination Survey. △ Less

Submitted 19 April, 2022; originally announced April 2022.

arXiv:2202.11276 [pdf, other]

doi 10.1111/rssa.12841

Nearest neighbor ratio imputation with incomplete multi-nomial outcome in survey sampling

Authors: Chenyin Gao, Katherine Jenny Thompson, Shu Yang, Jae Kwang Kim

Abstract: Nonresponse is a common problem in survey sampling. Appropriate treatment can be challenging, especially when dealing with detailed breakdowns of totals. Often, the nearest neighbor imputation method is used to handle such incomplete multinomial data. In this article, we investigate the nearest neighbor ratio imputation estimator, in which auxiliary variables are used to identify the closest donor… ▽ More Nonresponse is a common problem in survey sampling. Appropriate treatment can be challenging, especially when dealing with detailed breakdowns of totals. Often, the nearest neighbor imputation method is used to handle such incomplete multinomial data. In this article, we investigate the nearest neighbor ratio imputation estimator, in which auxiliary variables are used to identify the closest donor and the vector of proportions from the donor is applied to the total of the recipient to implement ratio imputation. To estimate the asymptotic variance, we first treat the nearest neighbor ratio imputation as a special case of predictive matching imputation and apply the linearization method of \cite{yang2020asymptotic}. To account for the non-negligible sampling fractions, parametric and generalized additive models are employed to incorporate the smoothness of the imputation estimator, which results in a valid variance estimator. We apply the proposed method to estimate expenditures detail items based on empirical data from the 2018 collection of the Service Annual Survey, conducted by the United States Census Bureau. Our simulation results demonstrate the validity of our proposed estimators and also confirm that the derived variance estimators have good performance even when the sampling fraction is non-negligible. △ Less

Submitted 22 February, 2022; originally announced February 2022.

Comments: Accepted for publication in JRSS(A)

arXiv:2107.07371 [pdf, other]

Statistical inference using Regularized M-estimation in the reproducing kernel Hilbert space for handling missing data

Authors: Hengfang Wang, Jae Kwang Kim

Abstract: Imputation and propensity score weighting are two popular techniques for handling missing data. We address these problems using the regularized M-estimation techniques in the reproducing kernel Hilbert space. Specifically, we first use the kernel ridge regression to develop imputation for handling item nonresponse. While this nonparametric approach is potentially promising for imputation, its stat… ▽ More Imputation and propensity score weighting are two popular techniques for handling missing data. We address these problems using the regularized M-estimation techniques in the reproducing kernel Hilbert space. Specifically, we first use the kernel ridge regression to develop imputation for handling item nonresponse. While this nonparametric approach is potentially promising for imputation, its statistical properties are not investigated in the literature. Under some conditions on the order of the tuning parameter, we first establish the root-$n$ consistency of the kernel ridge regression imputation estimator and show that it achieves the lower bound of the semiparametric asymptotic variance. A nonparametric propensity score estimator using the reproducing kernel Hilbert space is also developed by a novel application of the maximum entropy method for the density ratio function estimation. We show that the resulting propensity score estimator is asymptotically equivalent to the kernel ridge regression imputation estimator. Results from a limited simulation study are also presented to confirm our theory. The proposed method is applied to analyze the air pollution data measured in Beijing, China. △ Less

Submitted 15 July, 2021; originally announced July 2021.

Comments: arXiv admin note: text overlap with arXiv:2102.00058

arXiv:2107.06448 [pdf, other]

Survey data integration for regression analysis using model calibration

Authors: Zhonglei Wang, Hang J. Kim, Jae Kwang Kim

Abstract: We consider regression analysis in the context of data integration. To combine partial information from external sources, we employ the idea of model calibration which introduces a "working" reduced model based on the observed covariates. The working reduced model is not necessarily correctly specified but can be a useful device to incorporate the partial information from the external data. The ac… ▽ More We consider regression analysis in the context of data integration. To combine partial information from external sources, we employ the idea of model calibration which introduces a "working" reduced model based on the observed covariates. The working reduced model is not necessarily correctly specified but can be a useful device to incorporate the partial information from the external data. The actual implementation is based on a novel application of the information projection and model calibration weighting. The proposed method is particularly attractive for combining information from several sources with different missing patterns. The proposed method is applied to a real data example combining survey data from Korean National Health and Nutrition Examination Survey and big data from National Health Insurance Sharing Service in Korea. △ Less

Submitted 11 October, 2022; v1 submitted 13 July, 2021; originally announced July 2021.

arXiv:2104.13469 [pdf, other]

Information projection approach to propensity score estimation for handling selection bias under missing at random

Authors: Hengfang Wang, Jae Kwang Kim

Abstract: Propensity score weighting is widely used to improve the representativeness and correct the selection bias in the voluntary sample. The propensity score is often developed using a model for the sampling probability, which can be subject to model misspecification. In this paper, we consider an alternative approach of estimating the inverse of the propensity scores using the density ratio function s… ▽ More Propensity score weighting is widely used to improve the representativeness and correct the selection bias in the voluntary sample. The propensity score is often developed using a model for the sampling probability, which can be subject to model misspecification. In this paper, we consider an alternative approach of estimating the inverse of the propensity scores using the density ratio function satisfying the self-efficiency condition. The smoothed density ratio function is obtained by the solution to the information projection onto the space satisfying the moment conditions on the balancing scores. By including the covariates for the outcome regression models only in the density ratio model, we can achieve efficient propensity score estimation. Penalized regression is used to identify important covariates. We further extend the proposed approach to the multivariate missing case. Some limited simulation studies are presented to compare with the existing methods. △ Less

Submitted 19 July, 2022; v1 submitted 27 April, 2021; originally announced April 2021.

arXiv:2011.05988 [pdf, other]

Maximum sampled conditional likelihood for informative subsampling

Authors: HaiYing Wang, Jae Kwang Kim

Abstract: Subsampling is a computationally effective approach to extract information from massive data sets when computing resources are limited. After a subsample is taken from the full data, most available methods use an inverse probability weighted (IPW) objective function to estimate the model parameters. The IPW estimator does not fully utilize the information in the selected subsample. In this paper,… ▽ More Subsampling is a computationally effective approach to extract information from massive data sets when computing resources are limited. After a subsample is taken from the full data, most available methods use an inverse probability weighted (IPW) objective function to estimate the model parameters. The IPW estimator does not fully utilize the information in the selected subsample. In this paper, we propose to use the maximum sampled conditional likelihood estimator (MSCLE) based on the sampled data. We established the asymptotic normality of the MSCLE and prove that its asymptotic variance covariance matrix is the smallest among a class of asymptotically unbiased estimators, including the IPW estimator. We further discuss the asymptotic results with the L-optimal subsampling probabilities and illustrate the estimation procedure with generalized linear models. Numerical experiments are provided to evaluate the practical performance of the proposed method. △ Less

Submitted 9 October, 2022; v1 submitted 11 November, 2020; originally announced November 2020.

arXiv:2001.03259 [pdf, ps, other]

Statistical Data Integration in Survey Sampling: A Review

Authors: Shu Yang, Jae Kwang Kim

Abstract: Finite population inference is a central goal in survey sampling. Probability sampling is the main statistical approach to finite population inference. Challenges arise due to high cost and increasing non-response rates. Data integration provides a timely solution by leveraging multiple data sources to provide more robust and efficient inference than using any single data source alone. The techniq… ▽ More Finite population inference is a central goal in survey sampling. Probability sampling is the main statistical approach to finite population inference. Challenges arise due to high cost and increasing non-response rates. Data integration provides a timely solution by leveraging multiple data sources to provide more robust and efficient inference than using any single data source alone. The technique for data integration varies depending on types of samples and available information to be combined. This article provides a systematic review of data integration techniques for combining probability samples, probability and non-probability samples, and probability and big data samples. We discuss a wide range of integration methods such as generalized least squares, calibration weighting, inverse probability weighting, mass imputation and doubly robust methods. Finally, we highlight important questions for future research. △ Less

Submitted 9 January, 2020; originally announced January 2020.

Comments: Submitted to Japanese Journal of Statistics and Data Science

arXiv:1909.06534 [pdf, other]

Semiparametric Imputation Using Conditional Gaussian Mixture Models under Item Nonresponse

Authors: Danhyang Lee, Jae Kwang Kim

Abstract: Imputation is a popular technique for handling item nonresponse in survey sampling. Parametric imputation is based on a parametric model for imputation and is less robust against the failure of the imputation model. Nonparametric imputation is fully robust but is not applicable when the dimension of covariates is large due to the curse of dimensionality. Semiparametric imputation is another robust… ▽ More Imputation is a popular technique for handling item nonresponse in survey sampling. Parametric imputation is based on a parametric model for imputation and is less robust against the failure of the imputation model. Nonparametric imputation is fully robust but is not applicable when the dimension of covariates is large due to the curse of dimensionality. Semiparametric imputation is another robust imputation based on a flexible model where the number of model parameters can increase with the sample size. In this paper, we propose another semiparametric imputation based on a more flexible model assumption than the Gaussian mixture model. In the proposed mixture model, we assume a conditional Gaussian model for the study variable given the auxiliary variables, but the marginal distribution of the auxiliary variables is not necessarily Gaussian. We show that the proposed mixture model achieves a lower approximation error bound to any unknown target density than the Gaussian mixture model in terms of the Kullback-Leibler divergence. The proposed method is applicable to high dimensional covariate problem by including a penalty function in the conditional log-likelihood function. The proposed method is applied to 2017 Korean Household Income and Expenditure Survey conducted by Statistics Korea. Supplementary material is available online. △ Less

Submitted 19 September, 2019; v1 submitted 14 September, 2019; originally announced September 2019.

arXiv:1906.04398 [pdf, other]

An Approximate Bayesian Approach to Model-assisted Survey Estimation with Many Auxiliary Variables

Authors: Shonosuke Sugasawa, Jae Kwang Kim

Abstract: Model-assisted estimation with complex survey data is an important practical problem in survey sampling. When there are many auxiliary variables, selecting significant variables associated with the study variable would be necessary to achieve efficient estimation of population parameters of interest. In this paper, we formulate a regularized regression estimator in the framework of Bayesian infere… ▽ More Model-assisted estimation with complex survey data is an important practical problem in survey sampling. When there are many auxiliary variables, selecting significant variables associated with the study variable would be necessary to achieve efficient estimation of population parameters of interest. In this paper, we formulate a regularized regression estimator in the framework of Bayesian inference using the penalty function as the shrinkage prior for model selection. The proposed Bayesian approach enables us to get not only efficient point estimates but also reasonable credible intervals. Results from two limited simulation studies are presented to facilitate comparison with existing frequentist methods. △ Less

Submitted 31 March, 2020; v1 submitted 11 June, 2019; originally announced June 2019.

Comments: 37 pages

arXiv:1903.05212 [pdf, ps, other]

Doubly Robust Inference when Combining Probability and Non-probability Samples with High-dimensional Data

Authors: Shu Yang, Jae Kwang Kim, Rui Song

Abstract: Non-probability samples become increasingly popular in survey statistics but may suffer from selection biases that limit the generalizability of results to the target population. We consider integrating a non-probability sample with a probability sample which provides high-dimensional representative covariate information of the target population. We propose a two-step approach for variable selecti… ▽ More Non-probability samples become increasingly popular in survey statistics but may suffer from selection biases that limit the generalizability of results to the target population. We consider integrating a non-probability sample with a probability sample which provides high-dimensional representative covariate information of the target population. We propose a two-step approach for variable selection and finite population inference. In the first step, we use penalized estimating equations with folded-concave penalties to select important variables for the sampling score of selection into the non-probability sample and the outcome model. We show that the penalized estimating equation approach enjoys the selection consistency property for general probability samples. The major technical hurdle is due to the possible dependence of the sample under the finite population framework. To overcome this challenge, we construct martingales which enable us to apply Bernstein concentration inequality for martingales. In the second step, we focus on a doubly robust estimator of the finite population mean and re-estimate the nuisance model parameters by minimizing the asymptotic squared bias of the doubly robust estimator. This estimating strategy mitigates the possible first-step selection error and renders the doubly robust estimator root-n consistent if either the sampling probability or the outcome model is correctly specified. △ Less

Submitted 23 August, 2019; v1 submitted 12 March, 2019; originally announced March 2019.

arXiv:1903.03630 [pdf, ps, other]

Imputation estimators for unnormalized models with missing data

Authors: Masatoshi Uehara, Takeru Matsuda, Jae Kwang Kim

Abstract: Several statistical models are given in the form of unnormalized densities, and calculation of the normalization constant is intractable. We propose estimation methods for such unnormalized models with missing data. The key concept is to combine imputation techniques with estimators for unnormalized models including noise contrastive estimation and score matching. In addition, we derive asymptotic… ▽ More Several statistical models are given in the form of unnormalized densities, and calculation of the normalization constant is intractable. We propose estimation methods for such unnormalized models with missing data. The key concept is to combine imputation techniques with estimators for unnormalized models including noise contrastive estimation and score matching. In addition, we derive asymptotic distributions of the proposed estimators and construct confidence intervals. Simulation results with truncated Gaussian graphical models and the application to real data of wind direction reveal that the proposed methods effectively enable statistical inference with unnormalized models from missing data. △ Less

Submitted 8 June, 2020; v1 submitted 8 March, 2019; originally announced March 2019.

Comments: To appear (AISTATS 2020)

arXiv:1901.01645 [pdf, ps, other]

Bootstrap inference for the finite population total under complex sampling designs

Authors: Zhonglei Wang, Jae Kwang Kim, Liuhua Peng

Abstract: Bootstrap is a useful tool for making statistical inference, but it may provide erroneous results under complex survey sampling. Most studies about bootstrap-based inference are developed under simple random sampling and stratified random sampling. In this paper, we propose a unified bootstrap method applicable to some complex sampling designs, including Poisson sampling and probability-proportion… ▽ More Bootstrap is a useful tool for making statistical inference, but it may provide erroneous results under complex survey sampling. Most studies about bootstrap-based inference are developed under simple random sampling and stratified random sampling. In this paper, we propose a unified bootstrap method applicable to some complex sampling designs, including Poisson sampling and probability-proportional-to-size sampling. Two main features of the proposed bootstrap method are that studentization is used to make inference, and the finite population is bootstrapped based on a multinomial distribution by incorporating the sampling information. We show that the proposed bootstrap method is second-order accurate using the Edgeworth expansion. Two simulation studies are conducted to compare the proposed bootstrap method with the Wald-type method, which is widely used in survey sampling. Results show that the proposed bootstrap method is better in terms of coverage rate especially when sample size is limited. △ Less

Submitted 6 January, 2019; originally announced January 2019.

arXiv:1812.10694 [pdf, ps, other]

Combining Non-probability and Probability Survey Samples Through Mass Imputation

Authors: Jae Kwang Kim, Seho Park, Yilin Chen, Changbao Wu

Abstract: This paper presents theoretical results on combining non-probability and probability survey samples through mass imputation, an approach originally proposed by Rivers (2007) as sample matching without rigorous theoretical justification. Under suitable regularity conditions, we establish the consistency of the mass imputation estimator and derive its asymptotic variance formula. Variance estimators… ▽ More This paper presents theoretical results on combining non-probability and probability survey samples through mass imputation, an approach originally proposed by Rivers (2007) as sample matching without rigorous theoretical justification. Under suitable regularity conditions, we establish the consistency of the mass imputation estimator and derive its asymptotic variance formula. Variance estimators are developed using either linearization or bootstrap. Finite sample performances of the mass imputation estimator are investigated through simulation studies and an application to analyzing a non-probability sample collected by the Pew Research Centre. △ Less

Submitted 21 November, 2020; v1 submitted 27 December, 2018; originally announced December 2018.

Comments: Submitted to Journal of the Royal Statistical Society: Series A

arXiv:1811.11950 [pdf, other]

Accounting for model uncertainty in multiple imputation under complex sampling

Authors: Gyuhyeong Goh, Jae Kwang Kim

Abstract: Multiple imputation provides an effective way to handle missing data. When several possible models are under consideration for the data, the multiple imputation is typically performed under a single-best model selected from the candidate models. This single model selection approach ignores the uncertainty associated with the model selection and so leads to underestimation of the variance of multip… ▽ More Multiple imputation provides an effective way to handle missing data. When several possible models are under consideration for the data, the multiple imputation is typically performed under a single-best model selected from the candidate models. This single model selection approach ignores the uncertainty associated with the model selection and so leads to underestimation of the variance of multiple imputation estimator. In this paper, we propose a new multiple imputation procedure incorporating model uncertainty in the final inference. The proposed method incorporates possible candidate models for the data into the imputation procedure using the idea of Bayesian Model Averaging (BMA). The proposed method is directly applicable to handling item nonresponse in survey sampling. Asymptotic properties of the proposed method are investigated. A limited simulation study confirms that our model averaging approach provides better estimation performance than the single model selection approach. △ Less

Submitted 28 November, 2018; originally announced November 2018.

Comments: 23 pages, 1 Table

arXiv:1810.12519 [pdf, ps, other]

Semiparametric response model with nonignorable nonresponse

Authors: Masatoshi Uehara, Jae Kwang Kim

Abstract: How to deal with nonignorable response is often a challenging problem encountered in statistical analysis with missing data. Parametric model assumption for the response mechanism is often made and there is no way to validate the model assumption with missing data. We consider a semiparametric response model that relaxes the parametric model assumption in the response mechanism. Two types of effic… ▽ More How to deal with nonignorable response is often a challenging problem encountered in statistical analysis with missing data. Parametric model assumption for the response mechanism is often made and there is no way to validate the model assumption with missing data. We consider a semiparametric response model that relaxes the parametric model assumption in the response mechanism. Two types of efficient estimators, profile maximum likelihood estimator and profile calibration estimator, are proposed and their asymptotic properties are investigated. Two extensive simulation studies are used to compare with some existing methods. We present an application of our method using Korean Labor and Income Panel Survey data. △ Less

Submitted 30 October, 2018; originally announced October 2018.

arXiv:1809.05976 [pdf, ps, other]

Semiparametric fractional imputation using Gaussian mixture models for handling multivariate missing data

Authors: Hejian Sang, Jae Kwang Kim

Abstract: Item nonresponse is frequently encountered in practice. Ignoring missing data can lose efficiency and lead to misleading inference. Fractional imputation is a frequentist approach of imputation for handling missing data. However, the parametric fractional imputation of \cite{kim2011parametric} may be subject to bias under model misspecification. In this paper, we propose a novel semiparametric fra… ▽ More Item nonresponse is frequently encountered in practice. Ignoring missing data can lose efficiency and lead to misleading inference. Fractional imputation is a frequentist approach of imputation for handling missing data. However, the parametric fractional imputation of \cite{kim2011parametric} may be subject to bias under model misspecification. In this paper, we propose a novel semiparametric fractional imputation method using Gaussian mixture models. The proposed method is computationally efficient and leads to robust estimation. The proposed method is further extended to incorporate the categorical auxiliary information. The asymptotic model consistency and $\sqrt{n}$- consistency of the semiparametric fractional imputation estimator are also established. Some simulation studies are presented to check the finite sample performance of the proposed method. △ Less

Submitted 16 September, 2018; originally announced September 2018.

Comments: 23 pages, 2 figures

arXiv:1807.10873 [pdf, other]

Bayesian Sparse Propensity Score Estimation for Unit Nonresponse

Authors: Hejian Sang, Gyuhyeong Goh, Jae Kwang Kim

Abstract: Nonresponse weighting adjustment using propensity score is a popular method for handling unit nonresponse. However, including all available auxiliary variables into the propensity model can lead to inefficient and inconsistent estimation, especially with high-dimensional covariates. In this paper, a new Bayesian method using the Spike-and-Slab prior is proposed for sparse propensity score estimati… ▽ More Nonresponse weighting adjustment using propensity score is a popular method for handling unit nonresponse. However, including all available auxiliary variables into the propensity model can lead to inefficient and inconsistent estimation, especially with high-dimensional covariates. In this paper, a new Bayesian method using the Spike-and-Slab prior is proposed for sparse propensity score estimation. The proposed method is not based on any model assumption on the outcome variable and is computationally efficient. Instead of doing model selection and parameter estimation separately as in many frequentist methods, the proposed method simultaneously selects the sparse response probability model and provides consistent parameter estimation. Some asymptotic properties of the proposed method are presented. The efficiency of this sparse propensity score estimator is further improved by incorporating related auxiliary variables from the full sample. The finite-sample performance of the proposed method is investigated in two limited simulation studies, including a partially simulated real data example from the Korean Labor and Income Panel Survey. △ Less

Submitted 27 July, 2018; originally announced July 2018.

Comments: 38 pages, 3 tables

arXiv:1807.02817 [pdf, ps, other]

Integration of survey data and big observational data for finite population inference using mass imputation

Authors: Shu Yang, Jae Kwang Kim

Abstract: Multiple data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we consider an imputation approach to combining a probability sample with big observational data. Unlike the usual imputation for missing data analysis, we create imputed values for the whole elements in the probability sample. Such mass… ▽ More Multiple data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we consider an imputation approach to combining a probability sample with big observational data. Unlike the usual imputation for missing data analysis, we create imputed values for the whole elements in the probability sample. Such mass imputation is attractive in the context of survey data integration (Kim and Rao, 2012). We extend mass imputation as a tool for data integration of survey data and big non-survey data. The mass imputation methods and their statistical properties are presented. The matching estimator of Rivers (2007) is also covered as a special case. Variance estimation with mass-imputed data is discussed. The simulation results demonstrate the proposed estimators outperform existing competitors in terms of robustness and efficiency. △ Less

Submitted 8 July, 2018; originally announced July 2018.

arXiv:1801.09728 [pdf, ps, other]

doi 10.1111/insr.12290

Sampling techniques for big data analysis in finite population inference

Authors: Jae Kwang Kim, Zhonglei Wang

Abstract: In analyzing big data for finite population inference, it is critical to adjust for the selection bias in the big data. In this paper, we propose two methods of reducing the selection bias associated with the big data sample. The first method uses a version of inverse sampling by incorporating auxiliary information from external sources, and the second one borrows the idea of data integration by c… ▽ More In analyzing big data for finite population inference, it is critical to adjust for the selection bias in the big data. In this paper, we propose two methods of reducing the selection bias associated with the big data sample. The first method uses a version of inverse sampling by incorporating auxiliary information from external sources, and the second one borrows the idea of data integration by combining the big data sample with an independent probability sample. Two simulation studies show that the proposed methods are unbiased and have better coverage rates than their alternatives. In addition, the proposed methods are easy to implement in practice. △ Less

Submitted 29 January, 2018; originally announced January 2018.

Comments: 24 pages, 3 tables

arXiv:1707.00974 [pdf, ps, other]

Nearest neighbor imputation for general parameter estimation in survey sampling

Authors: Shu Yang, Jae Kwang Kim

Abstract: Nearest neighbor imputation is popular for handling item nonresponse in survey sampling. In this article, we study the asymptotic properties of the nearest neighbor imputation estimator for general population parameters, including population means, proportions and quantiles. For variance estimation, the conventional bootstrap inference for matching estimators with fixed number of matches has been… ▽ More Nearest neighbor imputation is popular for handling item nonresponse in survey sampling. In this article, we study the asymptotic properties of the nearest neighbor imputation estimator for general population parameters, including population means, proportions and quantiles. For variance estimation, the conventional bootstrap inference for matching estimators with fixed number of matches has been shown to be invalid due to the nonsmoothness nature of the matching estimator. We propose asymptotically valid replication variance estimation. The key strategy is to construct replicates of the estimator directly based on linear terms, instead of individual records of variables. A simulation study confirms that the new procedure provides valid variance estimation. △ Less

Submitted 30 June, 2017; originally announced July 2017.

Comments: 25 pages. arXiv admin note: substantial text overlap with arXiv:1703.10256

arXiv:1703.10256 [pdf, ps, other]

Predictive mean matching imputation in survey sampling

Authors: Shu Yang, Jae Kwang Kim

Abstract: Predictive mean matching imputation is popular for handling item nonresponse in survey sampling. In this article, we study the asymptotic properties of the predictive mean matching estimator of the population mean. For variance estimation, the conventional bootstrap inference for matching estimators with fixed matches has been shown to be invalid due to the nonsmoothness nature of the matching est… ▽ More Predictive mean matching imputation is popular for handling item nonresponse in survey sampling. In this article, we study the asymptotic properties of the predictive mean matching estimator of the population mean. For variance estimation, the conventional bootstrap inference for matching estimators with fixed matches has been shown to be invalid due to the nonsmoothness nature of the matching estimator. We propose asymptotically valid replication variance estimation. The key strategy is to construct replicates of the estimator directly based on linear terms, instead of individual records of variables. Extension to nearest neighbor imputation is also discussed. A simulation study confirms that the new procedure provides valid variance estimation. △ Less

Submitted 12 January, 2018; v1 submitted 29 March, 2017; originally announced March 2017.

Comments: 20 pages, 0 figure, 1 table

arXiv:1702.03453 [pdf, other]

An approximate Bayesian inference on propensity score estimation under unit nonresponse

Authors: Hejian Sang, Jae Kwang Kim

Abstract: Nonresponse weighting adjustment using the response propensity score is a popular tool for handling unit nonresponse. Statistical inference after the nonresponse weighting adjustment is complicated because the effect of estimating the propensity model parameter needs to be incorporated. In this paper, we propose an approximate Bayesian approach to handle unit nonresponse with parametric model assu… ▽ More Nonresponse weighting adjustment using the response propensity score is a popular tool for handling unit nonresponse. Statistical inference after the nonresponse weighting adjustment is complicated because the effect of estimating the propensity model parameter needs to be incorporated. In this paper, we propose an approximate Bayesian approach to handle unit nonresponse with parametric model assumptions on the response probability, but without model assumptions for the outcome variable. The proposed Bayesian method is calibrated to the frequentist inference in that the credible region obtained from the posterior distribution asymptotically matches to the frequentist confidence interval obtained from the Taylor linearization method. Unlike the frequentist approach, however, the proposed method does not involve Taylor linearization. The proposed method can be extended to handle over-identified cases in which there are more estimating equations than the parameters. Besides, the proposed method can also be modified to handle nonignorable nonresponse. Results from two simulation studies confirm the validity of the proposed methods, which are then applied to data from a Korean longitudinal survey. △ Less

Submitted 11 February, 2017; originally announced February 2017.

Comments: 38 pages

arXiv:1612.09207 [pdf, other]

Semiparametric Optimal Estimation With Nonignorable Nonresponse Data

Authors: Kosuke Morikawa, Jae Kwang Kim

Abstract: When the response mechanism is believed to be not missing at random (NMAR), a valid analysis requires stronger assumptions on the response mechanism than standard statistical methods would otherwise require. Semiparametric estimators have been developed under the model assumptions on the response mechanism. In this paper, a new statistical test is proposed to guarantee model identifiability withou… ▽ More When the response mechanism is believed to be not missing at random (NMAR), a valid analysis requires stronger assumptions on the response mechanism than standard statistical methods would otherwise require. Semiparametric estimators have been developed under the model assumptions on the response mechanism. In this paper, a new statistical test is proposed to guarantee model identifiability without using any instrumental variable. Furthermore, we develop optimal semiparametric estimation for parameters such as the population mean. Specifically, we propose two semiparametric optimal estimators that do not require any model assumptions other than the response mechanism. Asymptotic properties of the proposed estimators are discussed. An extensive simulation study is presented to compare with some existing methods. We present an application of our method using Korean Labor and Income Panel Survey data. △ Less

Submitted 7 May, 2020; v1 submitted 29 December, 2016; originally announced December 2016.

MSC Class: 62F35; 62G20; 62G10

arXiv:1508.06977 [pdf, ps, other]

doi 10.1093/biomet/asv073

A note on multiple imputation for method of moments estimation

Authors: Shu Yang, Jae Kwang Kim

Abstract: Multiple imputation is a popular imputation method for general purpose estimation. Rubin(1987) provided an easily applicable formula for the variance estimation of multiple imputation. However, the validity of the multiple imputation inference requires the congeniality condition of Meng(1994), which is not necessarily satisfied for method of moments estimation. This paper presents the asymptotic b… ▽ More Multiple imputation is a popular imputation method for general purpose estimation. Rubin(1987) provided an easily applicable formula for the variance estimation of multiple imputation. However, the validity of the multiple imputation inference requires the congeniality condition of Meng(1994), which is not necessarily satisfied for method of moments estimation. This paper presents the asymptotic bias of Rubin's variance estimator when the method of moments estimator is used as a complete-sample estimator in the multiple imputation procedure. A new variance estimator based on over-imputation is proposed to provide asymptotically valid inference for method of moments estimation. △ Less

Submitted 27 August, 2015; originally announced August 2015.

Comments: 8 pages, 0 figure

Journal ref: Biometrika (2016)

arXiv:1508.06945 [pdf, other]

doi 10.1214/16-STS569

Fractional Imputation in Survey Sampling: A Comparative Review

Authors: Shu Yang, Jae Kwang Kim

Abstract: Fractional imputation (FI) is a relatively new method of imputation for handling item nonresponse in survey sampling. In FI, several imputed values with their fractional weights are created for each missing item. Each fractional weight represents the conditional probability of the imputed value given the observed data, and the parameters in the conditional probabilities are often computed by an it… ▽ More Fractional imputation (FI) is a relatively new method of imputation for handling item nonresponse in survey sampling. In FI, several imputed values with their fractional weights are created for each missing item. Each fractional weight represents the conditional probability of the imputed value given the observed data, and the parameters in the conditional probabilities are often computed by an iterative method such as EM algorithm. The underlying model for FI can be fully parametric, semiparametric, or nonparametric, depending on plausibility of assumptions and the data structure. In this paper, we give an overview of FI, introduce key ideas and methods to readers who are new to the FI literature, and highlight some new development. We also provide guidance on practical implementation of FI and valid inferential tools after imputation. We demonstrate the empirical performance of FI with respect to multiple imputation using a pseudo finite population generated from a sample in Monthly Retail Trade Survey in US Census Bureau. △ Less

Submitted 27 August, 2015; originally announced August 2015.

Comments: 26 pages, 2 figures

Journal ref: Statistical Science (2016)

arXiv:1411.2305 [pdf, other]

Model-Parallel Inference for Big Topic Models

Authors: Xun Zheng, Jin Kyu Kim, Qirong Ho, Eric P. Xing

Abstract: In real world industrial applications of topic modeling, the ability to capture gigantic conceptual space by learning an ultra-high dimensional topical representation, i.e., the so-called "big model", is becoming the next desideratum after enthusiasms on "big data", especially for fine-grained downstream tasks such as online advertising, where good performances are usually achieved by regression-b… ▽ More In real world industrial applications of topic modeling, the ability to capture gigantic conceptual space by learning an ultra-high dimensional topical representation, i.e., the so-called "big model", is becoming the next desideratum after enthusiasms on "big data", especially for fine-grained downstream tasks such as online advertising, where good performances are usually achieved by regression-based predictors built on millions if not billions of input features. The conventional data-parallel approach for training gigantic topic models turns out to be rather inefficient in utilizing the power of parallelism, due to the heavy dependency on a centralized image of "model". Big model size also poses another challenge on the storage, where available model size is bounded by the smallest RAM of nodes. To address these issues, we explore another type of parallelism, namely model-parallelism, which enables training of disjoint blocks of a big topic model in parallel. By integrating data-parallelism with model-parallelism, we show that dependencies between distributed elements can be handled seamlessly, achieving not only faster convergence but also an ability to tackle significantly bigger model size. We describe an architecture for model-parallel inference of LDA, and present a variant of collapsed Gibbs sampling algorithm tailored for it. Experimental results demonstrate the ability of this system to handle topic modeling with unprecedented amount of 200 billion model variables only on a low-end cluster with very limited computational resources and bandwidth. △ Less

Submitted 9 November, 2014; originally announced November 2014.

arXiv:1406.4580 [pdf, other]

Primitives for Dynamic Big Model Parallelism

Authors: Seunghak Lee, Jin Kyu Kim, Xun Zheng, Qirong Ho, Garth A. Gibson, Eric P. Xing

Abstract: When training large machine learning models with many variables or parameters, a single machine is often inadequate since the model may be too large to fit in memory, while training can take a long time even with stochastic updates. A natural recourse is to turn to distributed cluster computing, in order to harness additional memory and processors. However, naive, unstructured parallelization of M… ▽ More When training large machine learning models with many variables or parameters, a single machine is often inadequate since the model may be too large to fit in memory, while training can take a long time even with stochastic updates. A natural recourse is to turn to distributed cluster computing, in order to harness additional memory and processors. However, naive, unstructured parallelization of ML algorithms can make inefficient use of distributed memory, while failing to obtain proportional convergence speedups - or can even result in divergence. We develop a framework of primitives for dynamic model-parallelism, STRADS, in order to explore partitioning and update scheduling of model variables in distributed ML algorithms - thus improving their memory efficiency while presenting new opportunities to speed up convergence without compromising inference correctness. We demonstrate the efficacy of model-parallel algorithms implemented in STRADS versus popular implementations for Topic Modeling, Matrix Factorization and Lasso. △ Less

Submitted 17 June, 2014; originally announced June 2014.

arXiv:1312.7651 [pdf, other]

Petuum: A New Platform for Distributed Machine Learning on Big Data

Authors: Eric P. Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, Yaoliang Yu

Abstract: What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization strategies employ fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or even specialized gr… ▽ More What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization strategies employ fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or even specialized graph-based execution that relies on graph representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of ML programs at scale. We propose a general-purpose framework that systematically addresses data- and model-parallel challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant, iterative-convergent algorithmic solutions. This presents unique opportunities for an integrative system design, such as bounded-error network synchronization and dynamic scheduling based on ML program structure. We demonstrate the efficacy of these system designs versus well-known implementations of modern ML algorithms, allowing ML programs to run in much less time and at considerably larger model sizes, even on modestly-sized compute clusters. △ Less

Submitted 14 May, 2015; v1 submitted 30 December, 2013; originally announced December 2013.

Comments: 15 pages, 10 figures, final version in KDD 2015 under the same title

arXiv:1312.5766 [pdf, other]

Structure-Aware Dynamic Scheduler for Parallel Machine Learning

Authors: Seunghak Lee, Jin Kyu Kim, Qirong Ho, Garth A. Gibson, Eric P. Xing

Abstract: Training large machine learning (ML) models with many variables or parameters can take a long time if one employs sequential procedures even with stochastic updates. A natural solution is to turn to distributed computing on a cluster; however, naive, unstructured parallelization of ML algorithms does not usually lead to a proportional speedup and can even result in divergence, because dependencies… ▽ More Training large machine learning (ML) models with many variables or parameters can take a long time if one employs sequential procedures even with stochastic updates. A natural solution is to turn to distributed computing on a cluster; however, naive, unstructured parallelization of ML algorithms does not usually lead to a proportional speedup and can even result in divergence, because dependencies between model elements can attenuate the computational gains from parallelization and compromise correctness of inference. Recent efforts toward this issue have benefited from exploiting the static, a priori block structures residing in ML algorithms. In this paper, we take this path further by exploring the dynamic block structures and workloads therein present during ML program execution, which offers new opportunities for improving convergence, correctness, and load balancing in distributed ML. We propose and showcase a general-purpose scheduler, STRADS, for coordinating distributed updates in ML algorithms, which harnesses the aforementioned opportunities in a systematic way. We provide theoretical guarantees for our scheduler, and demonstrate its efficacy versus static block structures on Lasso and Matrix Factorization. △ Less

Submitted 30 December, 2013; v1 submitted 19 December, 2013; originally announced December 2013.

arXiv:1108.1074 [pdf, ps, other]

doi 10.1214/10-AOAS419

Variance estimation for nearest neighbor imputation for US Census long form data

Authors: Jae Kwang Kim, Wayne A. Fuller, William R. Bell

Abstract: Variance estimation for estimators of state, county, and school district quantities derived from the Census 2000 long form are discussed. The variance estimator must account for (1) uncertainty due to imputation, and (2) raking to census population controls. An imputation procedure that imputes more than one value for each missing item using donors that are neighbors is described and the procedure… ▽ More Variance estimation for estimators of state, county, and school district quantities derived from the Census 2000 long form are discussed. The variance estimator must account for (1) uncertainty due to imputation, and (2) raking to census population controls. An imputation procedure that imputes more than one value for each missing item using donors that are neighbors is described and the procedure using two nearest neighbors is applied to the Census long form. The Kim and Fuller [Biometrika 91 (2004) 559--578] method for variance estimation under fractional hot deck imputation is adapted for application to the long form data. Numerical results from the 2000 long form data are presented. △ Less

Submitted 4 August, 2011; originally announced August 2011.

Comments: Published in at http://dx.doi.org/10.1214/10-AOAS419 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS419

Journal ref: Annals of Applied Statistics 2011, Vol. 5, No. 2A, 824-842

Showing 1–40 of 40 results for author: Kim, J K