Manca Dati Imputazione Binarie Options


Imputation strategies for missing binary outcomes in cluster randomized trials Background Attrition, which leads to missing data, is a common problem in cluster randomized trials (CRTs), where groups of patients rather than individuals are randomized. Standard multiple imputation (MI) strategies may not be appropriate to impute missing data from CRTs since they assume independent data. In this paper, under the assumption of missing completely at random and covariate dependent missing, we compared six MI strategies which account for the intra-cluster correlation for missing binary outcomes in CRTs with the standard imputation strategies and complete case analysis approach using a simulation study. We considered three within-cluster and three across-cluster MI strategies for missing binary outcomes in CRTs. The three within-cluster MI strategies are logistic regression method, propensity score method, and Markov chain Monte Carlo (MCMC) method, which apply standard MI strategies within each cluster. The three across-cluster MI strategies are propensity score method, random-effects (RE) logistic regression approach, and logistic regression with cluster as a fixed effect. Based on the community hypertension assessment trial (CHAT) which has complete data, we designed a simulation study to investigate the performance of above MI strategies. The estimated treatment effect and its 95 confidence interval (CI) from generalized estimating equations (GEE) model based on the CHAT complete dataset are 1.14 (0.76 1.70). When 30 of binary outcome are missing completely at random, a simulation study shows that the estimated treatment effects and the corresponding 95 CIs from GEE model are 1.15 (0.76 1.75) if complete case analysis is used, 1.12 (0.72 1.73) if within-cluster MCMC method is used, 1.21 (0.80 1.81) if across-cluster RE logistic regression is used, and 1.16 (0.82 1.64) if standard logistic regression which does not account for clustering is used. Conclusion When the percentage of missing data is low or intra-cluster correlation coefficient is small, different approaches for handling missing binary outcome data generate quite similar results. When the percentage of missing data is large, standard MI strategies, which do not take into account the intra-cluster correlation, underestimate the variance of the treatment effect. Within-cluster and across-cluster MI strategies (except for random-effects logistic regression MI strategy), which take the intra-cluster correlation into account, seem to be more appropriate to handle the missing outcome from CRTs. Under the same imputation strategy and percentage of missingness, the estimates of the treatment effect from GEE and RE logistic regression models are similar. 1. Introduction Cluster randomized trials (CRTs), where groups of participants rather than individuals are randomized, are increasingly being used in health promotion and health services research 1 . When participants have to be managed within the same setting, such as hospital, community, or family physician practice, this randomization strategy is usually adopted to minimize the potential treatment contamination between intervention and control participants. It is also used when individual level randomization may be inappropriate, unethical, or infeasible 2 . The main consequence of the cluster-randomized design is that participants can not be assumed independent due to the similarity of participants from the same cluster. This similarity is quantified by the intra-cluster correlation coefficient ICC . Considering the two components of the variation in the outcome, between-cluster and intra-cluster variations, may be interpreted as the proportion of overall variation in outcome that can be explained by the between-cluster variation 3 . It may also be interpreted as the correlation between the outcomes for any two participants in the same cluster. It has been well established that failing to account for the intra-cluster correlation in the analysis can increase the chance of obtaining statistically significant but spurious findings 4 . The risk of attrition may be very high in some CRTs due to the lack of direct contact with individual participants and lengthy follow-up 5 . In addition to missing individuals, the entire clusters may be missing, which further complicates the handling of missing data in CRTs. The impact of missing data on the results of statistical analysis depends on the mechanism which caused the data to be missing and the way that it is handled. The default approach in dealing with this problem is to use complete case analysis (also called listwise deletion), i. e. exclude the participants with missing data from the analysis. Though this approach is easy to use and is the default option in most statistical packages, it may substantially weaken the statistical power of the trial and may also lead to biased results depending on the mechanism of the missing data. Generally, the nature or type of missingness can fit into four categories: missing completely at random (MCAR), missing at random (MAR), covariate dependent (CD) missing, and missing not at random (MNAR) 6 . Understanding these categories is important since the solutions may vary depending on the nature of missingness. MCAR means that the missing data mechanism, i. e. the probability of missing, does not depend on the observed or unobserved data. Both MAR and CD mechanisms indicate that causes of missing data are unrelated to the missing values, but may be related to the observed values. In the context of longitudinal data when serial measurements are taken for each individual, MAR means that the probability of a missing response at a particular visit is related to either observed responses at previous visits or covariates, whereas CD missing - a special case of MAR - means that the probability of a missing response is dependent only upon covariates. MNAR means that the probability of missing data depends on the unobserved data. It commonly occurs when people drop out of the study due to poor or good health outcomes. A key distinction between these categories is that MNAR is non-ignorable while the other three categories (i. e. MCAR, CD, or MAR) are ignorable 7 . Under the circumstances of ignorable missingness, imputation strategies such as mean imputation, hot deck, last-observation carried forward, or multiple imputation (MI) - which substitute each missing value to one or multiple plausible values - can produce a complete dataset that is not adversely biased 8. 9 . Non-ignorable missing data are more challenging and require a different approach 10 . Two main approaches in handling missing outcomes are likelihood based analyses and imputation 10 . In this paper, we focus on MI strategies, which take into account the variability or uncertainty of the missing data, to impute the missing binary outcome in CRTs. Under the assumption of MAR, MI strategies replace each missing value with a set of plausible values to create multiple imputed datasets - usually varying in number from 3 to 10 11 . These multiple imputed datasets are analyzed by using standard procedures for complete data. Results from the imputed datasets are then combined for inference to generate the final result. Standard MI procedures are available in many standard statistical software packages such as SAS (Cary, NC), SPSS (Chicago IL), and STATA (College Station, TX). However, these procedures assume observations are independent and may not be suitable for CRTs since they do not take into account the intra-cluster correlation. To the best of our knowledge, limited investigation has been done on the imputation strategies for missing binary outcomes or categorical outcomes in CRTs. Yi and Cook reported marginal methods for missing longitudinal data from clustered design 12 . Hunsberger et al . 13 described three strategies for continuous missing data in CRTs: 1) multiple imputation procedure in which the missing values are replaced with re-sampled values from the observed data 2) a median procedure based on the Wilcoxon rank sum test assigning the missing data in the intervention group with the worst ranks 3) multiple imputation procedure in which the missing values are replaced by the predicted values from a regression equation. Nixon et al . 14 presented strategies of imputing missing end points from a surrogate. In the analysis of a continuous outcome from the Community Intervention Trial for Smoking Cessation (COMMIT), Green et al stratified individual participants into groups that were more homogeneous with respect to the predicted outcome. Within each stratum, they imputed the missing outcome using the observed data 15. 16 . Taljaard et al 17 compared several different imputation strategies for missing continuous outcomes in CRTs under the assumption of missing completely at random. These strategies include cluster mean imputation, within-cluster MI using Approximate Bayesian Bootstrap (ABB) method, pooled MI using ABB method, standard regression MI, and mixed-effects regression MI. As pointed out by Kenward et al that if a substantive model, such as generalized linear mixed model, is to be used which reflects the data structure, it is important that the imputation model also reflects this structure 18 . The objectives of this paper are to: i) investigate the performance of various imputation strategies for missing binary outcomes in CRTs under different percentages of missingness, assuming a mechanism of missing completely at random or covariate dependent missing ii) compare the agreement between the complete dataset and the imputed datasets obtained from different imputation strategies iii) compare the robustness of the results under two commonly used statistical analysis methods: the generalized estimating equations (GEE), and random-effects (RE) logistic regression, under different imputation strategies. 2. Methods In this paper, we consider three within-cluster and three across-cluster MI strategies for missing binary outcomes in CRTs. The three within-cluster MI strategies are logistic regression method, propensity score method, and MCMC method, which are standard MI strategies conducted within each cluster. The three across-cluster MI strategies are propensity score, random-effects logistic regression method, and logistic regression with cluster as a fixed effect. Based on the complete dataset from the community hypertension assessment trial (CHAT), we conducted a simulation study to investigate the performance of the above MI strategies. We used Kappa statistics to compare the agreement between the imputed datasets and the complete dataset. We also used the estimated treatment effects obtained from the GEE and RE logistic regression model 19 to assess the robustness of the results under different percentages of missing binary outcome under the assumption of MCAR and CD missing. 2.1. Complete case analysis Using this approach, only the patients with completed data are included for analysis, while patients with missing data are excluded. When the data are MCAR, the complete case analysis approach, using either likelihood-based analysis such as RE logistic regression, or the marginal model such as GEE approach, is valid for analyzing binary outcome from CRTs since the missing data mechanism is independent of the outcome. When the data are CD missing, both RE logistic regression and GEE approach are valid if the known covariates associated with the missing data mechanism are adjusted for. It can be implemented using GENMOD and NLMIXED procedure in SAS. 2.2. Standard multiple imputation Assuming the observations are independent, we can apply the standard MI procedures provided by any standard statistical software such as SAS. Three widely used MI methods are predictive model method (logistic regression method for binary data), propensity score method, and MCMC method 20 . In general, both propensity score method and MCMC method are recommended for the imputation of continuous variable 21 . A dataset is said to have a monotone missing pattern when a measurement Y j is missing for an individual implies that all subsequent measurements Y k . k gt j . are all missing for the individual. When the data are missing in the monotone missing pattern, any of the parametric predictive model and the nonparametric method that uses propensity scores or MCMC method is appropriate 21 . For an arbitrary missing data patterns, a MCMC method that assumes multivariate normality can be used 10 . These MI strategies are implemented using MI, MIANALYZE, GENMOD, and NLMIXED procedures in SAS separately for each intervention group. 2.2.1. Logistic regression method In this approach a logistic regression model is fitted using the observed outcome and covariates 21 . Based on the parameter estimates and the associated covariance matrix, the posterior predictive distribution of the parameters can be constructed. A new logistic regression model is then simulated from the posterior predictive distribution of the parameters and is used to impute the missing values. 2.2.2. Propensity score method The propensity score is the conditional probability of being missing given the observed data. It can be estimated by the means of logistic regression model with a binary outcome indicating whether the data are missing or not. The observations are then stratified into a number of strata based on these propensity scores. The ABB procedure 22 is then applied to each stratum. The ABB imputation first draws with replacement from the observed data to create a new dataset, which is a nonparametric analogue of drawing parameters from the posterior predictive distribution of the parameters, and then randomly draw imputed values with replacement from the new dataset. 2.2.3. Markov chain Monte Carlo method Using MCMC method pseudo random samples are drawn from a target probability distribution 21 . The target distribution is the joint conditional distribution of Y mis and given Y obs when missing data have a non-monotone pattern, where Y mis and Y obs represent the missing data and observed data, respectively, and represents the unknown parameters. The MCMC method is conducted as follows: replace Y mis by some assumed values, then simulate from the resulting complete data posterior distribution P( Y obs , Y mis ). Let ( t ) be the current simulated value of . then Y mis (t 1) can be drawn from the conditional predictive distribution Y m i s ( t 1 ) P ( Y m i s Y o b s. ( t ) ). Conditioning on Y mis (t 1). the next simulated value of can be drawn from its complete data posterior distribution ( t 1 ) P ( Y o b s. Y m i s ( t 1 ) ). By repeating the above procedure, we can generate a Markov chain which converges in distribution to P( Y mis , Y obs ). This method is attractive since it avoids complicated analytic calculation of the posterior distribution of and Y mis . However, the distribution convergence is an issue that researchers need to face. In addition, this method is based on the assumption of multivariate normality. When using it for imputing binary variables, the imputed values can be any real values. Most of the imputed values are between 0 and 1, some are out of this range. We round the imputed values to 0 if it is less than 0.5 and to 1 otherwise. This multiple imputation method is implemented using MI procedure in SAS. We use a single chain and non-informative prior for all imputations, and expectation-maximization (EM) algorithm to find maximum likelihood estimates in parametric models for incomplete data and derive parameter estimates from a posterior mode. The iterations are considered to have converged when the change in the parameter estimates between iteration steps is less than 0.0001 for each parameter. 2.3. Within-cluster multiple imputation Standard MI strategies are inappropriate for handling the missing data from CRTs due to the assumption of independent observations. For the within-cluster imputation, we carry out standard MI described above using logistic regression method, propensity score method, and MCMC method separately for each cluster. Thus, the missing values are imputed based on the observed data within the same cluster as the missing values. Given that subjects within the same cluster are more likely to be similar to each other than those from different clusters, within-cluster imputation can be seen as a strategy to impute the missing values to account for the intra-cluster correlation. These MI strategies are implemented using MI, MIANALYZE, GENMOD, and NLMIXED procedures in SAS. 2.4. Across-cluster multiple imputation 2.4.1. Propensity score method Compared to the standard multiple imputation using propensity score method, we added cluster as one of the covariates to obtain the propensity score for each observation. Consequently, patients within the same cluster are more likely to be categorized into the same propensity score stratum. Therefore, the intra-cluster correlation is taken into account when the ABB procedure is applied within each stratum to generate the imputed values for the missing data. This multiple imputation strategy is implemented using MI, MIANALYZE, GENMOD, and NLMIXED procedures in SAS. 2.4.2. Random-effects logistic regression Compared to the predictive model using standard logistic regression method, we assume the binary outcome is modeled by the random-effects logistic model: log it ( Pr ( Y i j l 1 ) ) X i j l U i j where Y ijl is the binary outcome of patient l in cluster j in the intervention group i X ijl is the matrix of fully observed individual-level or cluster level covariates, U i j N ( 0. B 2 ) represents the cluster-level random effect, and B 2 represent the between-cluster variance. B 2 can be estimated when fitting the random-effects logistic regression model using the observed outcome and covariates. The MI strategy using random-effects logistic regression method obtains the imputed values in three steps: (1) Fit a random-effects logistic regression model as described above using the observed outcome and covariates. Based on the estimates for and B obtained from step (1) and the associated covariance matrix, construct the posterior predictive distribution of these parameters. Fit a new random-effects logistic regression using the simulated parameters from the posterior predictive distribution and the observed covariates to obtain the imputed missing outcome. The MI strategy using random-effects logistic regression takes into account the between cluster variance, which is ignored in the MI strategy using standard logistic regression, and therefore may be valid for imputing missing binary data in CRTs. We provide the SAS code for this method in Appendix A. 2.4.3. Logistic regression with cluster as a fixed effect Compared to the predictive model using standard logistic regression method, we add cluster as a fixed effect to account for clustering effect. This multiple imputation strategy is implemented using MI, MIANALYZE, GENMOD, and NLMIXED procedures in SAS. 3. Simulation study 3.1. Community hypertension assessment trial The CHAT study was reported in detail elsewhere 23 . In brief, it was a cluster randomized controlled trial aimed at evaluating the effectiveness of pharmacy based blood pressure (BP) clinics led by peer health educators, with feedback to family physicians (FP) on the management and monitoring of BP among patients 65 years or older. The FP was the unit of randomization. Patients from the same FP received the same intervention. In total, 28 FPs participated in the study. Fourteen were randomly allocated to the intervention (pharmacy BP clinics) and 14 to the control group (no BP clinics offered). Fifty-five patients were randomly selected from each FP roster. Therefore, 1540 patients participated in the study. All eligible patients in both the intervention and control group received usual health service at their FPs office. Patients in the practices allocated to the intervention group were invited to visit the community BP clinics. Peer health educators assisted patients to measure their BP and review their cardiovascular risk factors. Research nurses conducted the baseline and end-of-trial (12 months after the randomization) audits of the health records of the 1540 patients who participated in the study. The primary outcome of the CHAT study was a binary outcome indicating whether the patients BP was controlled or not at the end of the trial. Patients BP was controlled if at the end of the trial, the systolic BP 140 mmHg and diastolic BP 90 mmHg for patient without diabetes or target organ damage, or the systolic BP 130 mmHg and diastolic BP 80 mmHg for patient with diabetes or target organ damage. Besides the intervention group, other predictors considered in this paper included age (continuous variable), sex (binary variable), diabetes at baseline (binary variable), heart disease at baseline (binary variable), and whether patients BP were controlled at baseline (binary variable). At the end of the trial, 55 patients BP were controlled. Without including any other predictors in the model, the treatment effects and their 95 confidence intervals (CI) estimated from the GEE and RE model were 1.14 (0.72, 1.80) and 1.10 (0.65, 1.86), respectively. The estimated ICC was 0.077. After adjustment for the above mentioned variables the treatment effects and their CIs estimated from GEE and RE model were 1.14 (0.76, 1.70) and 1.12 (0.72, 1.76), respectively. The estimated ICC was 0.055. Since there are no missing data in the CHAT dataset, it provides us a convenient platform to design a simulation study to compare the imputed and the observed values and further investigate the performance of the different multiple imputation strategies under different missing data mechanisms and percentages of missingness. 3.2. Generating dataset with missing binary outcome Using the CHAT study dataset, we investigated the performance of different MI strategies for missing binary outcome based on MCAR and CD mechanisms. Under the assumption of MCAR, we generated dataset with certain percentage of missing binary outcome, which indicates whether the BP was controlled or not at the end of the trial for each patient. The probability of missing for each patient was completely at random, i. e. the probability of missing did not depend on any observed or unobserved CHAT data. Under the assumption of CD missing, we considered sex, treatment group, whether patients BP controlled or not at baseline, which were commonly associated with drop out in clinical trials and observational studies 24 26 , were associated with the probability of missing. We further assumed that male patients were 1.2 times more likely to have missing outcome patients allocated to the control group were 1.3 times more likely to have missing outcome patients whose BP was not controlled at baseline were 1.4 times more likely to have missing outcome than patients whose BP were controlled at baseline. 3.3. Design of simulation study First we compared the agreement between the values of the imputed outcome variable and the true values of the outcome variable using Kappa statistics. Kappa statistic is the most commonly used statistic for assessing the agreement between two observers or methods which take into account the fact that they will sometimes agree or disagree simply by chance 27 . It is calculated based on the difference between how much agreement is actually present compared to how much agreement would be expected to be present by chance alone. A Kappa of 1 indicates the perfect agreement, and 0 indicates agreement equivalent to chance. Kappa statistic has been widely used by researchers to evaluate the performance of different imputation techniques on imputing missing categorical data 28. 29 . Second, under MCAR and CD missing, we compared the treatment effect estimates from the RE and GEE methods under the following scenarios: 1) exclude the missing values from the analysis, i. e. complete case analysis 2) apply standard multiple imputation strategies which do not take the intra-cluster correlation into account 3) apply the within-cluster imputation strategies and 4) apply the across-cluster imputation strategies. We designed the simulation study according to the following steps. 1) Generated 5, 10, 15, 20, 30 and 50 missing outcomes under both MCAR and CD missing assumption. These amounts of missingness were chosen to cover the range of possible missingness in practice 30 . Applied the above multiple imputation strategies to generate m 5 datasets. According to Rubin, the relative efficiency of the MI does not increase much when generating more than 5 imputed datasets 11 . Calculated Kappa statistic to assess the agreement between the values of imputed outcome variable and the true values of the outcome variable. Obtained the single treatment effect estimate by combining the effect estimates from the 5 imputed datasets using GEE and RE model. Repeated the above four steps for 1000 times, i. e. take 1000 simulation runs. Calculated the overall Kappa statistic by averaging the Kappa statistic from the 1000 simulation runs. Calculated the overall treatment effect and its standard error by averaging the treatment effects and their standard errors from the 1000 simulation runs. 4. Results 4.1. Results when data are missing completely at random With 5, 10, 15, 20, 30 or 50 percentage of missingness under MCAR assumption, the estimated Kappa for all different imputation strategies are slightly over 0.95, 0.90, 0.85, 0.80, 0.70, and 0.50 respectively. The estimated Kappa for different imputation strategies at different percentage of missing outcomes under the assumption of MCAR are presented in detail in Table 1. Kappa statistics for different imputation strategies when missingness is completely at random Treatment effect estimated from random-effects logistic regression when 30 data is covariate dependent missing . 5. Discussion In this paper, under the assumption of MCAR and CD missing, we compared six MI strategies which account for the intra-cluster correlation for missing binary outcomes in CRTs with the standard imputation strategies and complete case analysis approach using a simulation study. Our results show that, first, when the percentage of missing data is low or intra-cluster correlation coefficient is small, different imputation strategies or complete case analysis approach generate quite similar results. Second, standard MI strategies, which do not take into account the intra-cluster correlation, underestimate the variance of the treatment effects. Therefore, they may lead to statistically significant but spurious conclusion when used to deal with the missing data from CRTs. Third, under the assumption of MCAR and CD missing, the point estimates (OR) are quite similar across different approaches to handle the missing data except for random-effects logistic regression MI strategy. Fourth, both within-cluster and across-cluster MI strategies take into account the intra-cluster correlation and provide much conservative treatment effect estimates compared to MI strategies which ignore the clustering effect. Fifth, within-cluster imputation strategies lead to wider CI than across-cluster imputation strategies, especially when the percentage of missingness is high. This may be because within-cluster imputation strategies only use a fraction of data, which leads to much variation of the estimated treatment effect. Sixth, larger estimated kappa, which indicates higher agreement between the imputed values and the observed values, is associated with better performance of MI strategies in terms of generating estimated treatment effect and 95 CI closer to those obtained from the complete CHAT dataset. Seventh, under the same imputation strategy and percentage of missingness, the estimates of the treatment effect from GEE and RE logistic regression models are similar. To the best of our knowledge, limited work has been done on comparing different multiple imputation strategies for missing binary outcomes in CRTs. Taljaard et al 17 compared four MI strategies (pooled ABB, within-cluster ABB, standard regression, mixed-effects regression) for missing continuous outcome in CRTs when missing is completely at random. Their findings are similar to ours. It should be noted that within-cluster MI strategies might only be applicable when the cluster size is sufficiently large and the percentage of missingness is relatively small. In the CHAT study, there were 55 patients in each cluster which provided enough data to carry out the within-cluster imputation strategies using propensity score and MCMC method. However, the logistic regression method failed when the percentage of missingness was high. This was because that when generating large percentage (20) of missing outcome, all patients with binary outcome of 0 were simulated as missing for some clusters. Therefore, logistic regression model failed for these particular clusters. In addition, our results show that the complete case analysis approach performs relatively well even with 50 missing. We think that due to the intra-cluster correlation, one would not expect that the missing values have much impact if a large proportion of a cluster is still present. However, further investigation about this issue using a simulation study will be helpful to answer this question. Our results show that the across-cluster random-effects logistic regression strategy leads to a potentially biased estimate, especially when the percentage of missingness is high. As we described in section 2.4.2, we assume the cluster-level random-effects follow normal distribution, i. e. U i j N ( 0. B 2 ). Researchers have shown that misspecification of the distributional shape have little impact on the inferences about the fixed effects 31 . Incorrectly assuming the random effects distribution is independent of the cluster size may affect inferences about the intercept, but does not seriously impact inferences about the regression parameters. However, incorrectly assuming the random effects distribution is independent of covariates may seriously impact inferences about the regression parameters 32. 33 . The mean of random effects distribution could be associated with a covariate, or the variance of random effects distribution could be associated with a covariate for our dataset, which might explain the potential bias from the across-cluster random-effects logistic regression strategy. In contrast, the imputation strategy of logistic regression with cluster as a fixed effect has better performance. However, it might only be applied when the cluster size is large enough to provide stable estimate for the cluster effect. For multiple imputation, the overall variance of the estimated treatment effect consists of two parts: within imputation variance U . and between imputation variance B . The total variance T is calculated as T U (1 1 m ) B . where m is the number of imputed datasets 10 . Since standard MI strategies ignore the between cluster variance and fail to account for the intra-cluster correlation, the within imputation variance may be underestimated, which could lead to underestimation of the total variance and consequently the narrower confidence interval. In addition, the adequacy of standard MI strategies depends on the ICC. In our study, the ICC of the CHAT dataset is 0.055 and the cluster effect in the random-effects model is statistically significant. Among the three imputation methods: predictive model (logistic regression method), propensity score method, and MCMC method, the latter is most popular method for multiple imputation of missing data and is the default method implemented in SAS. Although this method is widely used to impute binary and polytomous data, there are concerns about the consequences of violating the normality assumption. Experience has repeatedly shown that multiple imputation using MCMC method tends to be quite robust even when the real data depart from the multivariate normal distribution 20 . Therefore, when handling the missing binary or ordered categorical variables, it is acceptable to impute under a normality assumption and then round off the continuous imputed values to the nearest category. For example, the imputed values for the missing binary variable can be any real value rather than being restricted to 0 and 1. We rounded the imputed values so that values greater than or equal to 0.5 were set to 1, and values less than 0.5 were set to 0 34 . Horton et al 35 showed that such rounding may produce biased estimates of proportions when the true proportion is near 0 or 1, but does well under most other conditions. The propensity score method is originally designed to impute the missing values on the response variables from the randomized experiment with repeated measures 21 . Since it uses only the covariate information associated with the missingness but ignores the correlation among variables, it may produce badly biased estimates of regression coefficients when data on predictor variables are missing. In addition, with small sample sizes and a relatively large number of propensity score groups, application of the ABB method is problematic, especially for binary variables. In this case, a modified version of ABB should be conducted 36 . There are some limitations that need to be acknowledged and addressed regarding the present study. First, the simulation study is based on a real dataset, which has a relatively large cluster size and small ICC. Further research should investigate the performance of different imputation strategies at different design settings. Second, the scenario of missing an entire cluster is not investigated in this paper. The proposed within-cluster and across-cluster MI strategies may not apply to this scenario. Third, we investigate the performance of different MI strategies assuming missing data mechanism of MCAR and CD missing. Therefore, results cannot be generalized to MAR or MNAR scenarios. Fourth, since the estimated treatment effects are similar under different imputation strategies, we only presented the OR and 95 CI for each simulation scenario. However, estimates of standardized bias and coverage would be more informative and would also provide a quantitative guideline to assess the adequacy of imputes 37 . 6. Conclusions When the percentage of missing data is low or intra-cluster correlation coefficient is small, different imputation strategies or complete case analysis approach generate quite similar results. When the percentage of missing data is high, standard MI strategies, which do not take into account the intra-cluster correlation, underestimate the variance of the treatment effect. Within-cluster and across-cluster MI strategies (except for the random-effects logistic regression MI strategy), which take the intra-cluster correlation into account, seem to be more appropriate to handle the missing outcome from CRTs. Under the same imputation strategy and percentage of missingness, the estimates of the treatment effect from GEE and RE logistic regression models are similar. Appendix A: SAS code for across-cluster random-effects logistic regression method let maximum 1000 ods listing close proc nlmixed data mcaramppercentampindex cov parms b0 -0.0645 bgroup -0.1433 bdiabbase -0.04 bhdbase 0.1224 bage -0.0066 bbasebpcontrolled 1.1487 bsex 0.0873 s2u 0.5 Population Health Research Institute, Hamilton Health Sciences References Campbell MK, Grimshaw JM: Cluster randomised trials: time for improvement. The implications of adopting a cluster design are still largely being ignored. BMJ. 1998, 317 (7167): 1171-1172. View Article PubMed PubMed Central Google Scholar COMMIT Research Group: Community Intervention trial for Smoking Cessation (COMMIT): 1. Cohort results from a four-year community intervention. Am J Public Health. 1995, 85: 183-192. 10.2105AJPH.85.2.183. View Article Google Scholar Donner A, Klar N: Design and Analysis of Cluster Randomisation Trials in Health Research. 2000, London: Arnold Google Scholar Cornfield J: Randomization by group: a formal analysis. Am J Epidemiol. 1978, 108 (2): 100-102. PubMed Google Scholar Donner A, Brown KS, Brasher P: A methodological review of non-therapeutic intervention trials employing cluster randomization, 1979-1989. Int J Epidemiol. 1990, 19 (4): 795-800. 10.1093ije19.4.795. View Article PubMed Google Scholar Rubin DB: Inference and missing data. Biometrika. 1976, 63: 581-592. 10.1093biomet63.3.581. View Article Google Scholar Allison PD: Missing Data. 2001, SAGE Publications Inc Google Scholar Schafer JL, Olsen MK: Multiple imputation for multivariate missing-data problems: a data analysts perspective. Multivariate Behavioral Research. 1998, 33: 545-571. 10.1207s15327906mbr33045. View Article PubMed Google Scholar McArdle JJ: Structural factor analysis experiments with incomplete data. Multivariate Behavioral Research. 1994, 29: 409-454. 10.1207s15327906mbr29045. View Article PubMed Google Scholar Little RJA, Rubin DB: Statistical Analysis with missing data. 2002, New York: John Wiley, Second Google Scholar Rubin DB: Multiple Imputation for Nonresponse in Surveys. 1987, New York, NY. John Wiley amp Sons, Inc View Article Google Scholar Yi GYY, Cook RJ: Marginal Methods for Incomplete Longitudinal Data Arising in Clusters. Journal of the American Statistical Association. 2002, 97 (460): 1071-1080. 10.1198016214502388618889. View Article Google Scholar Hunsberger S, Murray D, Davis CE, Fabsitz RR: Imputation strategies for missing data in a school-based multi-centre study: the Pathways study. Stat Med. 2001, 20 (2): 305-316. 10.10021097-0258(20010130)20:2lt305::AID-SIM645gt3.0.CO2-M. View Article PubMed Google Scholar Nixon RM, Duffy SW, Fender GR: Imputation of a true endpoint from a surrogate: application to a cluster randomized controlled trial with partial information on the true endpoint. BMC Med Res Methodol. 2003, 3: 17-10.11861471-2288-3-17. View Article PubMed PubMed Central Google Scholar Green SB, Corle DK, Gail MH, Mark SD, Pee D, Freedman LS, Graubard BI, Lynn WR: Interplay between design and analysis for behavioral intervention trials with community as the unit of randomization. Am J Epidemiol. 1995, 142 (6): 587-593. PubMed Google Scholar Green SB: The advantages of community-randomized trials for evaluating lifestyle modification. Control Clin Trials. 1997, 18 (6): 506-13. 10.1016S0197-2456(97)00013-5. discussion 514-6 View Article PubMed Google Scholar Taljaard M, Donner A, Klar N: Imputation strategies for missing continuous outcomes in cluster randomized trials. Biom J. 2008, 50 (3): 329-345. 10.1002bimj.200710423. View Article PubMed Google Scholar Kenward MG, Carpenter J: Multiple imputation: current perspectives. Stat Methods Med Res. 2007, 16 (3): 199-218. 10.11770962280206075304. View Article PubMed Google Scholar Dobson AJ: An introduction to generalized linear models. 2002, Boca Raton: Chapman amp HallCRC, 2 Google Scholar Schafer JL: Analysis of Incomplete Multivariate Data. 1997, London: Chapman and Hall View Article Google Scholar SAS Publishing: SASSTAT 9.1 Users Guide: support. sasdocumentationonlinedoc91pdfsasdoc91statug7313.pdf Rubin DB, Schenker N: Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. Journal of the American Statistical Association. 1986, 81 (394): 366-374. 10.23072289225. View Article Google Scholar Ma J, Thabane L, Kaczorowski J, Chambers L, Dolovich L, Karwalajtys T, Levitt C: Comparison of Bayesian and classical methods in the analysis of cluster randomized controlled trials with a binary outcome: the Community Hypertension Assessment Trial (CHAT). BMC Med Res Methodol. 2009, 9: 37-10.11861471-2288-9-37. View Article PubMed PubMed Central Google Scholar Levin KA: Study design VII. Randomised controlled trials. Evid Based Dent. 2007, 8 (1): 22-23. 10.1038sj. ebd.6400473. View Article PubMed Google Scholar Matthews FE, Chatfield M, Freeman C, McCracken C, Brayne C, MRC CFAS: Attrition and bias in the MRC cognitive function and ageing study: an epidemiological investigation. BMC Public Health. 2004, 4: 12-10.11861471-2458-4-12. View Article PubMed PubMed Central Google Scholar Ostbye T, Steenhuis R, Wolfson C, Walton R, Hill G: Predictors of five-year mortality in older Canadians: the Canadian Study of Health and Aging. J Am Geriatr Soc. 1999, 47 (10): 1249-1254. View Article PubMed Google Scholar Viera AJ, Garrett JM: Understanding interobserver agreement: the kappa statistic. Fam Med. 2005, 37 (5): 360-363. PubMed Google Scholar Laurenceau JP, Stanley SM, Olmos-Gallo A, Baucom B, Markman HJ: Community-based prevention of marital dysfunction: multilevel modeling of a randomized effectiveness study. J Consult Clin Psychol. 2004, 72 (6): 933-943. 10.10370022-006X.72.6.933. View Article PubMed Google Scholar Shrive FM, Stuart H, Quan H, Ghali WA: Dealing with missing data in a multi-question depression scale: a comparison of imputation methods. BMC Med Res Methodol. 2006, 6: 57-10.11861471-2288-6-57. View Article PubMed PubMed Central Google Scholar Elobeid MA, Padilla MA, McVie T, Thomas O, Brock DW, Musser B, Lu K, Coffey CS, Desmond RA, St-Onge MP, Gadde KM, Heymsfield SB, Allison DB: Missing data in randomized clinical trials for weight loss: scope of the problem, state of the field, and performance of statistical methods. PLoS One. 2009, 4 (8): e6624-10.1371journal. pone.0006624. View Article PubMed PubMed Central Google Scholar McCulloch CE, Neuhaus JM: Prediction of Random Effects in Linear and Generalized Linear Models under Model Misspecification. Biometrics. Neuhaus JM, McCulloch CE: Separating between - and within-cluster covariate effects using conditional and partitioning methods. Journal of the Royal Statistical Society. 2006, 859-872. Series B, 68 Heagerty PJ, Kurland BF: Misspecified maximum likelihood estimates and generalised linear mixed models. Biometrika. 2001, 88 (4): 973-985. 10.1093biomet88.4.973. View Article Google Scholar Christopher FA: Rounding after multiple imputation with Non-binary categorical covariates. SAS Focus Session SUGI. 2004, 30: Google Scholar Horton NJ, Lipsitz SR, Parzen M: A potential for bias when rounding in multiple imputation. American Statistician. 2003, 229-232. 10.11980003130032314. 57 Li X, Mehrotra DV, Barnard J: Analysis of incomplete longitudinal binary data using multiple imputation. Stat Med. 2006, 25 (12): 2107-2124. 10.1002sim.2343. View Article PubMed Google Scholar Collins LM, Schafer JL, Kam CM: A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol Methods. 2001, 6 (4): 330-351. 10.10371082-989X.6.4.330. View Article PubMed Google Scholar Pre-publication history Ma et al licensee BioMed Central Ltd. 2011 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( creativecommons. orglicensesby2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Multiple Imputation LIMDEPrsquos new implementation of multiple imputation is woven into the entire program, not just a few specific models. Any estimator, even your own created with MAXIMIZE, or any other computation involving data that produces a coefficient vector and a sampling covariance matrix, can be based on multiple imputed data sets. And, we have built this technique to bypass the need to create multiple data sets 8211 traditionally, the need to replicate the full data set has hobbled this method. LIMDEPrsquos implementation of multiple imputation uses only the existing data set. The results are fully replicable as well. (You can create and save the imputed data sets if you wish.) Multiple Imputation Features Including Continuous Data, Binary Variables, Ordered Outcomes and More Imputation equations for filling missing values Up to 30 variables imputed simultaneously Six types of imputation procedures for Continuous variables using multiple regression Binary variables using logistic regression Count variables using Poisson regression Likert scale (ordered outcomes) using ordered probit Fractional (proportional outcome) using logistic regression Unordered multinomial choice using multinomial logit No duplication of the base data set Estimation step for any model in LIMDEP or NLOGIT All models supported by built in procedures Any model written by the user with GMME, MAXIMIZE, NLSQ, etc. Estimate any number of models using each imputed data set Here is a constructed example based on a data set that contains 27,326 observations and about 30 variables. The variable married is a marital status dummy variable. We have injected about 10 missing values into this binary variable. We create an imputation equation for married with the IMPUTE command. The procedure then fits a probit model that uses married and several other variables. The missing values are imputed using age, education and income in each of 25 iterations. The second set of results is the simple probit results using casewise deletion rather than imputation. Multiple Imputation in Stata: Imputing This is part four of the Multiple Imputation in Stata series. For a list of topics covered by this series, see the Introduction . This section will talk you through the details of the imputation process. Be sure youve read at least the previous section, Creating Imputation Models. so you have a sense of what issues can affect the validity of your results. Example Data To illustrate the process, well use a fabricated data set. Unlike those in the examples section, this data set is designed to have some resemblance to real world data. female (binary) race (categorical, three values) urban (binary) edu (ordered categorical, four values) exp (continuous) wage (continuous) Missingness . Each value of all the variables except female has a 10 chance of being missing completely at random, but of course in the real world we wont know that it is MCAR ahead of time. Thus we will check whether it is MCAR or MAR (MNAR cannot be checked by looking at the observed data) using the procedure outlined in Deciding to Impute : unab numvars: unab missvars: urban-wage misstable sum, gen(miss) foreach var of local missvars local covars: list numvars - var display newline(3) quotlogit missingness of var on covarsquot logit missvar covars foreach nvar of local covars display newline(3) quotttest of nvar by missingness of varquot ttest nvar, by(missvar) See the log file for results. Our goal is to regress wages on sex, race, education level, and experience. To see the quotrightquot answers, open the do file that creates the data set and examine the gen command that defines wage. Complete code for the imputation process can be found in the following do file: The imputation process creates a lot of output. Well put highlights in this page, however, a complete log file including the associated graphs can be found here: Each section of this article will have links to the relevant section of the log. Click quotbackquot in your browser to return to this page. Setting up The first step in using mi commands is to mi set your data. This is somewhat similar to svyset. tsset. or xtset. The mi set command tells Stata how it should store the additional imputations youll create. We suggest using the wide format, as it is slightly faster. On the other hand, mlong uses slightly less memory. To have Stata use the wide data structure, type: To have Stata use the mlong (marginal long) data structure, type: The wide vs. long terminology is borrowed from reshape and the structures are similar. However, they are not equivalent and you would never use reshape to change the data structure used by mi. Instead, type mi convert wide or mi convert mlong (add, clear if the data have not been saved since the last change). Most of the time you dont need to worry about how the imputations are stored: the mi commands figure out automatically how to apply whatever you do to each imputation. But if you need to manipulate the data in a way mi cant do for you, then youll need to learn about the details of the structure youre using. Youll also need to be very, very careful. If youre interested in such things (including the rarely used flong and flongsep formats) run this do file and read the comments it contains while examining the data browser to see what the data look like in each form. Registering Variables The mi commands recognize three kinds of variables: Imputed variables are variables that mi is to impute or has imputed. Regular variables are variables that mi is not to impute, either by choice or because they are not missing any values. Passive variables are variables that are completely determined by other variables. For example, log wage is determined by wage, or an indicator for obesity might be determined by a function of weight and height. Interaction terms are also passive variables, though if you use Statas interaction syntax you wont have to declare them as such. Passive variables are often problematic8212the examples on transformations. non-linearity. and interactions show how using them inappropriately can lead to biased estimates. If a passive variable is determined by regular variables, then it can be treated as a regular variable since no imputation is needed. Passive variables only have to be treated as such if they depend on imputed variables. Registering a variable tells Stata what kind of variable it is. Imputed variables must always be registered: mi register imputed varlist where varlist should be replaced by the actual list of variables to be imputed. Regular variables often dont have to be registered, but its a good idea: mi register regular varlist Passive variables must be registered: mi register passive varlist However, passive variables are more often created after imputing. Do so with mi passive and theyll be registered as passive automatically. In our example data, all the variables except female need to be imputed. The appropriate mi register command is: mi register imputed race-wage (Note that you cannot use as your varlist even if you have to impute all your variables, because that would include the system variables added by mi set to keep track of the imputation structure.) Registering female as regular is optional, but a good idea: mi register regular female Checking the Imputation Model Based on the types of the variables, the obvious imputation methods are: race (categorical, three values): mlogit urban (binary): logit edu (ordered categorical, four values): ologit exp (continuous): regress wage (continuous): regress female does not need to be imputed, but should be included in the imputation models both because it is in the analysis model and because its likely to be relevant. Before proceeding to impute we will check each of the imputation models. Always run each of your imputation models individually, outside the mi impute chained context, to see if they converge and (insofar as it is possible) verify that they are specified correctly. Code to run each of these models is: mlogit race i. urban exp wage i. edu i. female logit urban i. race exp wage i. edu i. female ologit edu i. urban i. race exp wage i. female regress exp i. urban i. race wage i. edu i. female regress wage i. urban i. race exp i. edu i. female Note that when categorical variables (ordered or not) appear as covariates i. expands them into sets of indicator variables. As well see later, the output of the mi impute chained command includes the commands for the individual models it runs. Thus a useful shortcut, especially if you have a lot of variables to impute, is to set up your mi impute chained command with the dryrun option to prevent it from doing any actual imputing, run it, and then copy the commands from the output into your do file for testing. Convergence Problems The first thing to note is that all of these models run successfully. Complex models like mlogit may fail to converge if you have large numbers of categorical variables, because that often leads to small cell sizes. To pin down the cause of the problem, remove most of the variables, make sure the model works with whats left, and then add variables back one at a time or in small groups until it stops working. With some experimentation you should be able to identify the problem variable or combination of variables. At that point youll have to decide if you can combine categories or drop variables or make other changes in order to create a workable model. Prefect Prediction Perfect prediction is another problem to note. The imputation process cannot simply drop the perfectly predicted observations the way logit can. You could drop them before imputing, but that seems to defeat the purpose of multiple imputation. The alternative is to add the augment (or just aug ) option to the affected methods. This tells mi impute chained to use the quotaugmented regressionquot approach, which adds fake observations with very low weights in such a way that they have a negligible effect on the results but prevent perfect prediction. For details see the section quotThe issue of perfect prediction during imputation of categorical dataquot in the Stata MI documentation. Checking for Misspecification You should also try to evaluate whether the models are specified correctly. A full discussion of how to determine whether a regression model is specified correctly or not is well beyond the scope of this article, but use whatever tools you find appropriate. Here are some examples: Residual vs. Fitted Value Plots For continuous variables, residual vs. fitted value plots (easily done with rvfplot ) can be useful8212several of the examples use them to detect problems. Consider the plot for experience: regress exp i. urban i. race wage i. edu i. female rvfplot Note how a number of points are clustered along a line in the lower left, and no points are below it: This reflects the constraint that experience cannot be less than zero, which means that the fitted values must always be greater than or equal to the residuals, or alternatively that the residuals must be greater than or equal to the negative of the fitted values. (If the graph had the same scale on both axes, the constraint line would be a 45 degree line.) If all the points were below a similar line rather than above it, this would tell you that there was an upper bound on the variable rather than a lower bound. The y-intercept of the constraint line tells you the limit in either case. You can also have both a lower bound and an upper bound, putting all the points in a band between them. The quotobviousquot model, regress. is inappropriate for experience because it wont apply this constraint. Its also inappropriate for wages for the same reason. Alternatives include truncreg, ll(0) and pmm (well use pmm ). Adding Interactions In this example, it seems plausible that the relationships between variables may vary between race, gender, and urbanrural groups. Thus one way to check for misspecification is to add interaction terms to the models and see whether they turn out to be important. For example, well compare the obvious model: regress exp i. race wage i. edu i. urban i. female with one that includes interactions: regress exp (i. race i. urban i. female)(c. wage i. edu) Well run similar comparisons for the models of the other variables. This creates a great deal of output, so see the log file for results. Interactions between female and other variables are significant in the models for exp. wage. edu. and urban. There are a few significant interactions between race or urban and other variables, but not nearly as many (and keep in mind that with this many coefficients wed expect some false positives using a significance level of .05). Well thus impute the men and women separately. This is an especially good option for this data set because female is never missing. If it were, wed have to drop those observations which are missing female because they could not be placed in one group or the other. In the imputation command this means adding the by(female) option. When testing models, it means starting the commands with the by female: prefix (and removing female from the lists of covariates). The improved imputation models are thus: bysort female: reg exp i. urban i. race wage i. edu by female: logit urban exp i. race wage i. edu by female: mlogit race exp i. urban wage i. edu by female: reg wage exp i. urban i. race i. edu by female: ologit edu exp i. urban i. race wage pmm itself cannot be run outside the imputation context, but since its based on regression you can use regular regression to test it. These models should be tested again, but well omit that process. The basic syntax for mi impute chained is: mi impute chained ( method1 ) varlist1 ( method2 ) varlist2. regvars Each method specifies the method to be used for imputing the following varlist The possibilities for method are regress. pmm. truncreg. intreg. logit. ologit. mlogit. poisson. and nbreg. regvars is a list of regular variables to be used as covariates in the imputation models but not imputed (there may not be any). The basic options are: add( N ) rseed( R ) savetrace( tracefile. replace) N is the number of imputations to be added to the data set. R is the seed to be used for the random number generator8212if you do not set this youll get slightly different imputations each time the command is run. The tracefile is a dataset in which mi impute chained will store information about the imputation process. Well use this dataset to check for convergence. Options that are relevant to a particular method go with the method, inside the parentheses but following a comma (e. g. (mlogit, aug) ). Options that are relevant to the imputation process as a whole (like by(female) ) go at the end, after the comma. For our example, the command would be: mi impute chained (logit) urban (mlogit) race (ologit) edu (pmm) exp wage, add(5) rseed(4409) by(female) Note that this does not include a savetrace() option. As of this writing, by() and savetrace() cannot be used at the same time, presumably because it would require one trace file for each by group. Stata is aware of this problem and we hope this will be changed soon. For purposes of this article, well remove the by() option when it comes time to illustrate use of the trace file. If this problem comes up in your research, talk to us about work-arounds. Choosing the Number of Imputations There is some disagreement among authorities about how many imputations are sufficient. Some say 3-10 in almost all circumstances, the Stata documentation suggests at least 20, while White, Royston, and Wood argue that the number of imputations should be roughly equal to the percentage of cases with missing values. However, we are not aware of any argument that increasing the number of imputations ever causes problems (just that the marginal benefit of another imputation asymptotically approaches zero). Increasing the number of imputations in your analysis takes essentially no work on your part. Just change the number in the add() option to something bigger. On the other hand, it can be a lot of work for the computer8212multiple imputation has introduced many researchers into the world of jobs that take hours or days to run. You can generally assume that the amount of time required will be proportional to the number of imputations used (e. g. if a do file takes two hours to run with five imputations, it will probably take about four hours to run with ten imputations). So heres our suggestion: Start with five imputations (the low end of whats broadly considered legitimate). Work on your research project until youre reasonably confident you have the analysis in its final form. Be sure to do everything with do files so you can run it again at will. Note how long the process takes, from imputation to final analysis. Consider how much time you have available and decide how many imputations you can afford to run, using the rule of thumb that time required is proportional to the number of imputations. If possible, make the number of imputations roughly equal to the percentage of cases with missing data (a high end estimate of whats required). Allow time to recover if things to go wrong, as they generally do. Increase the number of imputations in your do file and start it. Do something else while the do file runs, like write your paper. Adding imputations shouldnt change your results significantly8212and in the unlikely event that they do, consider yourself lucky to have found that out before publishing. Speeding up the Imputation Process Multiple imputation has introduced many researchers into the world of jobs that take hours, days, or even weeks to run. Usually its not worth spending your time to make Stata code run faster, but multiple imputation can be an exception. Use the fastest computer available to you. For SSCC members that means learning to run jobs on Linstat, the SSCCs Linux computing cluster. Linux is not as difficult as you may think8212Using Linstat has instructions. Multiple imputation involves more reading and writing to disk than most Stata commands. Sometimes this includes writing temporary files in the current working directory. Use the fastest disk space available to you, both for your data set and for the working directory. In general local disk space will be faster than network disk space, and on Linstat ramdisk (a quotdirectoryquot that is actually stored in RAM) will be faster than local disk space. On the other hand, you would not want to permanently store data sets anywhere but network disk space. So consider having your do file do something like the following: Windows (Winstat or your own PC) This applies when youre using imputed data as well. If your data set is large enough that working with it after imputation is slow, the above procedure may help. Checking for Convergence MICE is an iterative process. In each iteration, mi impute chained first estimates the imputation model, using both the observed data and the imputed data from the previous iteration. It then draws new imputed values from the resulting distributions. Note that as a result, each iteration has some autocorrelation with the previous imputation. The first iteration must be a special case: in it, mi impute chained first estimates the imputation model for the variable with the fewest missing values based only on the observed data and draws imputed values for that variable. It then estimates the model for the variable with the next fewest missing values, using both the observed values and the imputed values of the first variable, and proceeds similarly for the rest of the variables. Thus the first iteration is often atypical, and because iterations are correlated it can make subsequent iterations atypical as well. To avoid this, mi impute chained by default goes through ten iterations for each imputed data set you request, saving only the results of the tenth iteration. The first nine iterations are called the burn-in period. Normally this is plenty of time for the effects of the first iteration to become insignificant and for the process to converge to a stationary state. However, you should check for convergence and increase the number of iterations if necessary to ensure it using the burnin() option. To do so, examine the trace file saved by mi impute chained. It contains the mean and standard deviation of each imputed variable in each iteration. These will vary randomly, but they should not show any trend. An easy way to check is with tsline. but it requires reshaping the data first. Our preferred imputation model uses by(). so it cannot save a trace file. Thus well remove by() for the moment. Well also increase the burnin() option to 100 so its easier to see what a stable trace looks like. Well then use reshape and tsline to check for convergence: preserve mi impute chained (logit) urban (mlogit) race (ologit) edu (pmm) exp wage female, add(5) rseed(88) savetrace(extrace, replace) burnin(100) use extrace, replace reshape wide mean sd, i(iter) j(m) tsset iter tsline expmean, title(quotMean of Imputed Values of Experiencequot) note(quotEach line is for one imputationquot) legend(off) graph export conv1.png, replace tsline expsd, title(quotStandard Deviation of Imputed Values of Experiencequot) note(quotEach line is for one imputationquot) legend(off) graph export conv2.png, replace restore The resulting graphs do not show any obvious problems: If you do see signs that the process may not have converged after the default ten iterations, increase the number of iterations performed before saving imputed values with the burnin() option. If convergence is never achieved this indicates a problem with the imputation model. Checking the Imputed Values After imputing, you should check to see if the imputed data resemble the observed data. Unfortunately theres no formal test to determine whats quotclose enough. quot Of course if the data are MAR but not MCAR, the imputed data should be systematically different from the observed data. Ironically, the fewer missing values you have to impute, the more variation youll see between the imputed data and the observed data (and between imputations). For binary and categorical variables, compare frequency tables. For continuous variables, comparing means and standard deviations is a good starting point, but you should look at the overall shape of the distribution as well. For that we suggest kernel density graphs or perhaps histograms. Look at each imputation separately rather than pooling all the imputed values so you can see if any one of them went wrong. The mi xeq: prefix tell Stata to apply the subsequent command to each imputation individually. It also applies to the original data, the quotzeroth imputation. quot Thus: mi xeq: tab race will give you six frequency tables: one for the original data, and one for each of the five imputations. However, we want to compare the observed data to just the imputed data, not the entire data set. This requires adding an if condition to the tab commands for the imputations, but not the observed data. Add a number or numlist to have mi xeq act on particular imputations: mi xeq 0: tab race mi xeq 15: tab race if missrace This creates frequency tables for the observed values of race and then the imputed values in all five imputations. If you have a significant number of variables to examine you can easily loop over them: foreach var of varlist urban race edu mi xeq 0: tab var mi xeq 15: tab var if missvar For results see the log file . Running summary statistics on continuous variables follows the same process, but creating kernel density graphs adds a complication: you need to either save the graphs or give yourself a chance to look at them. mi xeq: can carry out multiple commands for each imputation: just place them all in one line with a semicolon ( ) at the end of each. (This will not work if youve changed the general end-of-command delimiter to a semicolon.) The sleep command tells Stata to pause for a specified period, measured in milliseconds. mi xeq 0: kdensity wage sleep 1000 mi xeq 15: kdensity wage if missvar sleep 1000 Again, this can all be automated: foreach var of varlist wage exp mi xeq 0: sum var mi xeq 15: sum var if missvar mi xeq 0: kdensity var sleep 1000 mi xeq 15: kdensity var if missvar sleep 1000 Saving the graphs turns out to be a bit trickier, because you need to give the graph from each imputation a different file name. Unfortunately you cannot access the imputation number within mi xeq. However, you can do a forvalues loop over imputation numbers, then have mi xeq act on each of them: forval i15 mi xeq i: kdensity exp if missexp graph export expi. png, replace Integrating this with the previous version gives: foreach var of varlist wage exp mi xeq 0: sum var mi xeq 15: sum var if missvar mi xeq 0: kdensity var graph export chkvar0.png, replace forval i15 mi xeq i: kdensity var if missvar graph export chkvari. png, replace For results, see the log file . Its troublesome that in all imputations the mean of the imputed values of wage is higher than the mean of the observed values of wage. and the mean of the imputed values of exp is lower than the mean of the observed values of exp. We did not find evidence that the data is MAR but not MCAR, so wed expect the means of the imputed data to be clustered around the means of the observed data. There is no formal test to tell us definitively whether this is a problem or not. However, it should raise suspicions, and if the final results with these imputed data are different from the results of complete cases analysis, it raises the question of whether the difference is due to problems with the imputation model. Last Revised: 8232012

Comments

Popular posts from this blog

Forex Výběr Brokers

Prezzi American Binary Options

Streaming Forex Prezzi Api