Multiple Imputation After 18+ Years

Multiple imputation was designed to handle the problem of missing data in public-use data bases where the data-base constructor and the ultimate user are distinct entities. The objective is valid frequency inference for ultimate users who in general have access only to complete-data software and possess limited knowledge of specific reasons and models for nonresponse. For this situation and objective, I believe that multiple imputation by the data-base constructor is the method of choice. This article first provides a description of the assumed context and objectives, and second, reviews the multiple imputation framework and its standard results. These preliminary discussions are especially important because some recent commentaries on multiple imputation have reflected either misunderstandings of the practical objectives of multiple imputation or misunderstandings of fundamental theoretical results. Then, criticisms of multiple imputation are considered, and, finally, comparisons are made to alternative strategies.

Missing values are a problem in many data sets and seem especially common in the medical and social sciences.For nearly two decades I have been advocating and developing the use of multiple imputation to address aspects of this problem; early documents include Rubin (1977aRubin ( , 1977bRubin ( , 1978Rubin ( , 1980Rubin ( , 1983)), Herzog and Rubin (1983), Rubin and Schenker (1986), and the basic reference Rubin (1987).There are situations where multiple imputation is appropriate, and, as with any statistical tool, there are others where its application is more questionable.Originally it was viewed as being most appropriate in complex surveys that are used to create public-use data sets to be shared by many ultimate users, although over the years, it has proven valuable in other settings as well.
For the context for which it was envisioned, with database constructors and ultimate users as distinct entities, I firmly believe that multiple imputation is the method of choice for addressing problems due to missing values: alternative methods either require special knowledge and techniques not available to typical users or produce answers that are generally not statistically valid for scientific estimands.This is a strong statement, and it is clear that its accuracy must depend on the class of problems to which it is applied.Consequently this article begins with a description of the assumed statistical computing environment for the ultimate users of shared data-bases and of our objectives for handling missing data in this environment.It is especially important to provide this background to emphasize that the goal of multiple imputation is to provide statistically valid infer- In Section 2 multiple imputation is reviewed, with particular emphasis given to how it was designed to satisfy the stated objectives in the assumed environment for ultimate users.This review of critical points of the theory and intended practice of multiple imputation minimizes technical details so that essential statistical points will be more transparent than in the theoretical material in Rubin (1987), which requires substantial familiarity with, and acceptance of the relevance of, both randomization-based and Bayesian inference.Then, in Section 3, current concerns about multiple imputation are discussed with the benefit of the simplified theory.Finally, competing techniques are evaluated for their utility in the assumed context and are found to be less effective than multiple imputation.

Assumed Environment for Ultimate Users
Public-use (shared) data bases are analyzed by many ultimate users with varying degrees of statistical expertise and computing power, and with different scientific questions and objectives.Typically such users have available to them a number of standard complete-data techniques.These include various stand-alone routines such as ones for ordinary least-squares regression, logistic regression, factor analysis, variance components estimation, proportional hazards models, etc., and various packages of programs such as SAS, BMDP, SPSS, etc. Also, there may be available routines for inference in the presence of missing data under particular models (e.g., Schafer 1995), complete-data management routines for merging files, subsetting data, deleting cases and variables, applying transformations, and creating new variables, or various resampling programs to create simulated replicate data, principally jackknife and bootstrap routines.

? 1996 American Statistical Association
Essentially all public-use data sets have missing values, typically not of any nice neat type.In general, ultimate users have neither the knowledge nor the tools to address missing data problems satisfactorily.Even if some ultimate users do have adequate resources for modeling and computation, data-base constructors typically know more about reasons for nonresponse and have better access to confidential and detailed information not released for public use (e.g., exact addresses and neighborhood relationships, hourly blood pressure readings and doctor indicators), information that can be useful for modeling missing data.Moreover, ultimate users should be focused on their substantive scientific analyses and for these, missing data are generally simply a nuisance.My conclusion is that "correctly" modeling the missing data must be, in general, the data constructor's responsibility.
We, that is, data-base constructors and statistical software designers, have no direct control over what ultimate users will do with their arsenal of tools.We cannot stop users from doing bad science, but if possible we should facilitate their ability to do good science with their available tools, even when data sets suffer from missing values.

Achievable Basic Objective
One achievable basic objective in such a setting is the following: Each tool in the ultimate users' existing arsenals can be applied to any data set with missing values using the same command structure and output standards as if there were no missing data.The only additional software that is allowed to be required comprises entirely general macros that can be applied to any complete-data analysis and incomplete data set.Certain ad hoc methods of handling missing data, such as "complete-case analysis," "available-case analysis," and "fill-in with means" (e.g., see Little and Rubin 1987, part I), satisfy this basic objective and so have a certain appeal.The problem with such methods is that they typically yield statistically invalid answers for scientific estimands; "scientific estimands" and "statistically valid" require definition.

Scientific Estimands
By a scientific estimand I mean a quantity of scientific interest that can be calculated in the population and does not change its value depending on the data collection design used to measure it (i.e., it does not vary with sample size and survey design, or the number of nonrespondents, or follow-up efforts).Letting X be the array of all background (e.g., stratification) information fully observed in a population and Y be the array of outcome information in the population that is to be sampled in the survey, a scientific estimand is a function of X and Y, say Q = Q(X, Y).Scientific estimands include population means, variances, correlations, factor loadings, regression coefficients, and these quantities within strata or domains, but exclude the sampling variance of a sample mean under a particular sampling plan and the expectation of the complete-data sample mean when missing values are filled in with zero or the observed sample mean.These latter quantities can be important for inference and design, but they are not scientific estimands in my definition because they are functions of sample size, sample design, response rates for a particular survey, methods for handling missing values, and scientific estimands such as population means and variances.
The distinction between estimates of scientific estimands and measures of their uncertainty is an old one in statistics; see, for example, Fisher (1925, p. 724) where a measure of uncertainty associated with an estimate is called an "ancillary" statistic, that is, a subordinate or supplemental statistic.In Fisher's context, the estimate was the maximum likelihood estimate and the ancillary statistic was the second derivative of the log-likelihood, but the distinction is relevant to more general estimates and associated measures of uncertainty, as we see in the next section.

What is Meant by Statistically Valid?
In the context of shared data bases supporting analyses by many users, I believe that statistically valid must be a frequency concept, averaging over randomization distributions generated by known sampling mechanisms (used to collect data) and posited distributions for the response mechanisms (the processes underlying nonresponse).In standard scientific surveys, the sampling mechanism is known but the nonresponse mechanisms is rarely fully known and so typically must be posited, either implicitly or explicitly.
Bayesian validity is also important, but is far more difficult to achieve in this context because it requires far more compatibility between the data-base constructor and the analyst.In fact, in general I do not believe it can be achieved in any real sense in the context of the basic objective to use existing complete-data tools with shared data bases.In any case, no Bayesian should object to achieving frequentist validity; effectively, Bayesians want and promise much more: calibration conditional on the data in addition to unconditional calibration (e.g., in Rubin 1984, I call such frequency calculations "Bayesianly relevant and justifiable").
First and foremost, for statistical validity for scientific estimands, point estimation must be approximately unbiased for the scientific estimands, averaging over the sampling and the posited nonresponse mechanisms (e.g., filling in zeros or means is not generally acceptable).Second, interval estimation and hypothesis testing must be valid in the sense that nominal levels describe operating characteristics over sampling and posited response mechanisms.Two versions of such frequentist validity for nominal levels are especially important to distinguish when assessing multiple imputation.
Using terminology from Rubin (1987, pp. 117-118) A more generally achievable objective, however, is "confidence validity," meaning that for interval estimates, "actual interval coverage > nominal interval coverage," and for tests of hypotheses, "actual rejection rate < nominal rejection rate."For confidence validity with complete data, we replace (1.2) with E(UIX,Y) > var(Q X,Y). (1.3) 2) is not, then U1 is only an "approximate weight for the value of the estimate" (Fisher 1925, p. 724).
The distinction between randomization validity and confidence validity can be quite important when dealing with approximate procedures, which necessarily arise with nonresponse in public-use surveys, and this distinction appears in Neyman (1934), which is the foundation for statisticians' current view of frequentist validity in surveys.Here Neyman (1934, pp.562-563) defined confidence intervals, confidence coefficients, and confidence limits, and these definitions remain the accepted mathematical definitions of these terms (e.g., Lehmann 1959).In particular, confidence limits are statistics defining an interval such that, in repeated experience, the estimand lies in the confidence interval with probability greater than or equal to the confidence coefficient; the shorter the interval satisfying this constraint, the better.
A simple example illustrates the wisdom implicit in Neyman's definition.Consider a particular situation with two different confidence-valid procedures for creating confidence intervals with confidence coefficient 95%.Procedure 1 produces intervals that are always shorter than the intervals produced by Procedure 2, and moreover, Procedure 1 has actually probability > 95% of covering the estimand, whereas Procedure 2 has only the nominal 95% probability of covering the estimand.Clearly, Procedure 1 is scientifically and statistically superior to Procedure 2 because it provides tighter inferences with greater confidence, and Neyman's definition and desiderata agree with this fact.Requiring exact agreement between nominal and actual levels as a desideratum for validity would lead one to reject Procedure 1 as invalid and choose Procedure 2, clearly a mistake.It is for this reason that confidence validity is more fundamental than randomization validity for interval estimation.
Of course, if we have a procedure that is confidence valid but not randomization valid, there is the hope that a bet-ter confidence-valid procedure exists (i.e., one with shorter intervals), which is also randomization valid, but in general this is not achievable.An attendant advantage, when the best confidence interval is randomization valid, is that the associated measure of precision can be thought of as a true rather than approximate weight (again, in the sense of Fisher 1925, p. 724also see Fisher 1934, criticizing Neyman 1934, on this point).

Supplemental Objective Concerning Statistical Validity
We are now prepared to supplement the Achievable Basic Objective when faced with missing values, regarding the ability to apply standard complete-data statistical tools, with an objective concerning statistically valid inference for scientific estimands.It is easy to ask for more than is possible and then do something misguided when attempting the impossible.We first consider a hopeless objective, which is commonly sought, and then state an achievable one.Hopeless Supplemental Objective.Each complete-data statistical tool can be applied to each incomplete data set to obtain the same inference as if the data set had no missing values.This objective is clearly impossible because of the lost information, but nevertheless, it guides some thinking about how to handle missing data.It is analogous to saying that the objective of a survey is to obtain the same answer as a complete census, and it can lead to an "operations research" objective of creating imputations for missing values that are as close as possible to the truth (i.e., fill in missing values to minimize some objective function).Our actual objective is valid statistical inference not optimal point prediction under some loss function, and replacing the former with the latter can lead one badly astray.For example, suppose we have a coin that, in truth, is biased .6 heads and .4tails.This known truth is model A, whereas model B asserts that the coin has two heads.Using model A for creating imputations (i.e., future predictions) yields a hit rate (agreements between predictions and outcomes) of .6 x .6+.4x .4 .52, whereas using model B for predictions yields a hit rate of .6.This does not mean that model B is better than model A for handling missing values.Filling in missing values using model B yields the invalid statistical inference that in the future all coin tosses will be heads, clearly inconsistent for the estimand Q = fraction of tosses that are heads, whereas using model A yields consistent estimates for all such scientific estimands.The lesson is simple: Judging the quality of missing data procedures by their ability to recreate the individual missing values (according to hit-rate, mean square error, etc.) does not lead to choosing procedures that result in valid inference, which is our objective.
Statistical validity in our context is difficult because the answer that results from applying a complete-data analysis to an incomplete data set is generally invalid unless the complete-data analysis in the absence of missing data is valid-the ultimate user's responsibility, and the reasons for missing data are correctly modelled-the data-base con-structor's responsibility.We can essentially never be sure that the data-base constructor's model is appropriate, but assuming it is, and assuming that the ultimate user is performing an analysis that would be valid if there were no missing data, we can expect that the ultimate user will obtain a valid inference.
Achievable Supplemental Objective.Assuming that the ultimate user's complete-data analysis is statistically valid for a scientific estimand, the answer that results from applying the same analysis method to an incomplete-data set remains statistically valid for the same scientific estimand assuming the truth of the data-base constructor's posited model for missing data.I doubt that there is a much stronger objective regarding validity that we can achieve in this context where the ultimate user and the data-base constructor are distinct entities.Multiple imputation was designed to satisfy both achievable objectives by using the Bayesian and frequentist paradigms in complementary ways: the Bayesian modelbased approach to create procedures, and the frequentist (randomization-based approach) to evaluate procedures.

REVIEW OF MULTIPLE IMPUTATION FRAMEWORK AND RESULTS
Multiple imputations for the set of missing values are multiple sets of plausible values; these can reflect uncertainty under one model for nonresponse and across several models.Each set of imputations is used to create a completed data set, each of which is to be analyzed using standard complete-data software to yield "completed-data" statistics, which are typically complete-data estimates, Q, associated variance-covariance matrices, U, and p values.The complete-data statistics Q and U are general; for example, U may be obtained by mathematical analysis, linearization methods, balanced-repeated replication, the jackknife (see, e.g., Krewski and Rao 1981), the bootstrap (see, e.g., Efron 1994), or special routines for complex surveys such as SUDAAN or VPLX (see, e.g., Fay 1990).But no matter how Q and U are calculated with complete data, once miss- ing data are filled in by imputation, they can be calculated as if the data set were complete.The key Bayesian motivation for multiple imputation is given by result 3.1 in Rubin (1987).Ignoring both technical details and indicator variables for sampling and response, the results and its consequences can be easily stated using the simplified notation that the complete-data are Y = (Yobs, Ymis), where Yobs is observed and Ymis is missing.Specifically, the basic result is

Evaluating Repeated-imputation Procedures Under the Randomization-Based Paradigm
The Bayesian paradigm, which is used to derive repeatedimputation inferences, is formally predicated on the correctness of all the model specifications.Although this paradigm is ideal for creating procedures, especially in complicated situations, its results cannot be unequivocally endorsed for routine practice because, in practice, we can never be sure any model assumptions are correct.Consequently, the Bayesianly-derived repeated-imputation procedures were evaluated in chapter 4 in Rubin (1987)

Conclusion Regarding Randomization Validity With Proper Multiple Imputation
The crucial result regarding the randomization validity of the large-rn repeated-imputation inference, given by (2.5), averages over both the actual sampling mechanism and the posited response mechanism; it is simple and holds no matter how complex the survey design: anism and an appropriate model for the data, then in large samples the imputation method is proper....There is little doubt that if this conclusion were formalized in a particular way, exceptions to it could be found.Its usefulness is not as a general mathematical result, but rather as a guide to practice.Nevertheless, in order to understand why it may be expected to hold relatively generally, it is important to provide a general heuristic argument for it (Rubin 1987, pp. 125-126).
This heuristic argument treated the sample as the population with estimand Q (and U), where the resulting posterior distribution of Q was centered at QO with variance Boo; assuming the Bayesian model appropriate [in the sense of satisfying (2.6) and (2.7)] and the samples large, standard arguments presented in chapter 2 of Rubin (1987) suggested that typically (Q -Qo)BJ1/2 will have a sampling distribution (over the response mechanism) that is standard normal, thereby satisfying the basic conditions for proper multiple imputation.

Include All Variables in a Multiple Imputation Model
To Make It Proper in General The definition of proper concerns the situation where: "population" = complete-data sample, "estimands" -complete-data statistics (Q, U), "survey design" = the posited response mechanism, the criterion is valid frequency inference, and the method for creating inferences is Bayesian predictive inference using simulated values (i.e., multiple imputations).As with any finite population survey where valid frequency inference is desired from predictive procedures: (1) variables involved in the definition of estimands (i.e., Q, U) should be predicted, and (2) variables involved in the survey design (i.e., the response mechanism) should be used as predictors.More explicitly, when Q or U involves some variable X, then leaving X out of the imputation scheme is improper and generally leads to biased estimation and invalid survey inference.For example, if X is correlated with Y but not used to multiply-impute Y, then the multiply-imputed data set will yield estimates of the XY correlation biased towards zero.In a complex survey, Q, and especially U, depend on stratification and clustering indicators; consequently, in general these indicators need to be included as predictor variables in imputation models for the multiple imputation scheme to be proper.Minimally, major clustering and stratification indicators and sample de-sign weights (or estimated propensity scores of being in the sample) should be included in imputation models.Ezzati-Rice, Johnson, Khare, Little, Rubin, and Schafer (1995) illustrates such efforts and the resulting valid inferences.
Since with public-use data sets it is always unclear what analyses the ultimate users will conduct, the range of statistics (Q, U) that might be used involves essentially any variable or combination of variables available in the data set, at least up to some level of interactions.Thus, the danger with an imputer's model is generally in leaving out predictors rather than including too many, and the advice has always been to include as many variables as possible when doing multiple imputation.The press to include all possibly relevant predictors is demanding in practice, but it is generally a worthy goal.For example, in the original prescription for the industry and occupation recoding project (Rubin 1983), thousands of logistic regressions were done, each with nearly 20 variables, and some with far fewer than 20 observations (e.g., 4), in order to preserve this theme of trying to include all variables that might be used to define statistics Q or U; this effort required the development of specialized but computationally efficient Bayesian logistic regression procedures for sparse data (Clogg, Rubin, Schenker, Schultz, and Weidman 1991).The possible lost precision when including unimportant predictors is usually viewed as a relatively small price to pay for the general validity of analyses of the resultant multiply-imputed data base.

Some Experience With Useful But Improper Multiple Imputation
In some cases, improper multiple imputations can still lead to confidence-valid repeated-imputation inferences.This issue will be discussed in more detail in Sections 3.5-3.8 in reply to a recent criticism of multiple imputation, but the issue has been previously considered.Rubin and Schenker (1987,  Substantial empirical work, some given in the Appendix, supports the conclusion that, even if mildly important predictors are left out of the multiple imputation scheme, the repeated-imputation inferences are confidence-valid: with fractions of missing information typical in careful surveys, m =3 or 5 works very well, with the complete-data procedure for small rn typically breaking down before multiple imputation does.A heuristic reason for this robustness is that lack of model fit goes into residual variance, which in a Bayesian model inflates the between-imputation variance of draws (e.g., of regression coefficients), thereby leading to a large enough Bm to compensate for an omitted coefficient.Of course, this is an observation based on some experience, not a theorem, but a related theoretical result (Meng 1994, lem.2) lends support to this observation.
Nevertheless, because problems can occur when the imputer's model leaves out important predictor variables, the data-base constructor must include a description of the imputation model with the multiply-imputed data base, so that ultimate users know which relationships among variables have been implicitly set to zero.

CURRENT ISSUES CONCERNING MULTIPLE IMPUTATION
There appear to be two distinct kinds of concerns about multiple imputation.The first type focuses on its implementation: operational difficulties for the data-base constructor and the ultimate user, as well as the acceptability of answers obtained partially through the use of simulation.The second type concerns the frequentist validity of repeatedimputation inferences when the multiple imputations are not proper, but appear "reasonable" in some sense.

Is Multiple Imputation Unprincipled or Unacceptable
Because it Uses Simulation?
An early criticism, not much heard anymore but worthy of response, is that multiple imputation is theoretically unsatisfactory and practically unacceptable because it adds random noise to the data.In this context, it is critical to remember that multiple imputation does not pretend to create information through simulated values but simply to represent the observed information this way so as to make it amenable to valid analysis using complete-data tools.The extra noise created when using a finite number of imputations is the price to be paid for this luxury.
In response to this criticism, first appreciate that simulation methods are becoming more and more common and accepted in statistics.Consider jackknife and bootstrap methods for complete-data frequentist inference (e.g., Miller 1974; Efron and Tibsharani 1993), or data augmentation (Tanner and Wong 1987), the Gibbs sampler (e.g., Gelfand and Smith 1990; Gelman and Rubin 1992), and sampling importance resampling methods (Rubin 1983(Rubin , 1987(Rubin , 1988; Gelfand and Smith 1992) for complete-data Bayesian inference.These methods have now become accepted completedata tools worthy of theoretical investigation and routine practical application.
Second, multiple imputation has a distinct advantage over such methods in principle, because with multiple imputation, the simulation is only being used to handle the missing information, with reliance for handling the rest of the information left to the complete-data method, be it analytic or simulation-based.Thus, the acceptable number of imputations can be much less than the acceptable number of simulations for a complete-data inference, at least assuming that the fraction of missing information, ty, is modest Finally, even when a particular multiple implementation method has deficiencies, it can only distort part of the inference in contrast to an incorrect complete-data analysis, which can distort the entire inference.For example, results in Heitjan and Rubin (1990) in a particular example suggest that doing some kind of multiple imputation, even if under a naive model, is far better inferentially than standard or sophisticated approaches with single imputation.In some vague sense, if a multiple imputation method is 20% deficient (80% okay) with 30% missing information, its total distortion is 20% of 30%, or 6%, implying that the repeated-imputation inference is 94% okay.

Is Multiple Imputation Too Much Work
For The User?
My primary response to this question is: "Too much work relative to doing what?"Multiple imputation is intellectually trivial for the user.Running the identical complete-data software rn times (e.g., 3, 5, or 10 times) and combining the results "by hand" is admittedly a burden, but is computationally trivial given appropriate macros (which are easy to write, e.g., in S-Plus; see Schafer 1995, or SAS, Freedman 1990).I believe it is substantially easier for the user, even without appropriate macros, than any other method that can validly address nonresponse in any generality.As repeatedly emphasized by many workers in this area, methods such as "fill in the mean and ignore," "available cases," "treat the data set as a two-way additive model and singly impute with zero interaction," etc., are not statistically valid in any generality, even for point estimation of a variety of estimands, such as means, variances, correlations, and are therefore not appropriate for public-use data bases.

Does a Multiply-imputed Data Set Take Too Much Storage?
A multiply-imputed data set, in terms of needed storage locations, is [1 + m (% missing values overall)] times as big as the original data set, typically a factor of two or less.For example, suppose the data set has 10,000 units; 20 background variables fully recorded; 20 "easy" survey questions, 5% missing; 30 "moderate" survey questions, 10% missing; 30 "difficult" survey questions, 30% missing: then the complete-data set = 1,000,000 items with 130,000 items missing.The associated multiply-imputed (m = 5) data set consists of the complete-data set of 870,000 data values and 130,000 pointers to the rows of the supplemental 130,000 x 5 matrix of imputations, for a total of 1,000,000 + 650,000 locations.Given the appropriate macros, we can unpack the multiple imputations to create five completed data sets only at the time of each of the five complete-data analyses, sequentially, in a manner transparent to the ultimate user, and using less than twice the storage needed for the original data set.Even with more missing values and more imputations per missing value, this issue should be easily handled with today's storage devices and simple and general macros, although it can be a burden without appropriate software.In situations with nonresponse confined to a few variables, an effective device can be to create a rectangular data set with m versions of these variables but one version of the fully observed variables.

Can Repeated Imputations Under An Appropriate
Bayesian Model Lead to Invalid Inferences?
Fay (1991,1992,1993; also see Kott 1992) claims that even when the model used to create repeated imputations is "appropriate" in some sense, the resulting repeated inferences can be invalid.I believe that this criticism is misguided for a variety of reasons, many of which have been exposed in the work and discussion of Meng (1994).Nevertheless, I will also briefly address the issue here because it has received attention, and because I believe my results, although less extensive and detailed than those of Meng (1994), will be more transparent to many readers.
The kernel of this criticism arises when an irrelevant predictor X of outcome Y is not used by the Bayesian multiple imputer to create repeated imputations, but is used by the ultimate analyst to define estimands (a case already introduced here in Sec.2.7 because of historical discussion of it).More specifically, suppose X is dichotomous, (a, b), and Y is normal (0, 1) and independent of X in a population in which X = a units and X = b units are equally represented.Suppose a stratified random sample of size 2n is taken where there are n units with X = a and n units with X = b, and further suppose that nonresponse is simply like another level of stratified random sampling that results in nr respondents and no nonrespondents in both the X = a sample and in the Now suppose repeated imputations for the 2no nonrespondents are generated using a fully exchangeable normal model based on the 2n1 respondents.That is, the imputations for both the X a and X = b units will be centered at the observed grand mean Yobs rather than at the separate observed sample means -obs,a and Yobs,b.It is easy to show that the multiple imputation method is proper for (-, Up), but it is improper for (d, Uj): (1), the expec-

Superefficient Imputations
In this example, the imputations are "superefficient" from the perspective of the data analyst interested in estimating D because the imputations use "extra" information, specifically the knowledge that the distribution of Y given X = a is identical to the distribution of Y given X = b.For a mtore familiar example of superefficiency, if the data are normal with mean zero, then half the sample mean is a supereffi-cient estimate of the population mean.The situation involving superefficient imputations is more subtle, however.Suppose that we have a multiply-imputed data set, but subsequently the data collector brings forth values of the missing data, thereby allowing us to calculate Q and U. Presumably, we would then be inclined to base our inferences for Q on (Q, U) and discard the imputations.If the imputations are superefficient, however, the standard complete-data procedure can be improved by using information in the imputations about Q beyond that in Q, information supplied by the imputer (e.g., in the canonical example, the knowledge that X = a units and X = b units have the same population distribution of Y).The imputations are "strongly superefficient" if Q<> is at least as good an estimate as Q despite the existence of missing data, that is, despite the fact that Q , is not identical to Q in the formal sense that var (Q,,, -QjX, Y) > O, (3.1) where with vector Q, ">" means at least one eigenvalue >0.
More precisely, a multiple imputation procedure is strongly superefficient for the complete-data statistic Q if, first, Q>, and Q estimate the same estimand, that is, the procedure is "first-moment proper" for Q, The general definition of superefficiency concerns the existence of imputations that make QCO informative about Q even with knowledge of Q. Bayesian models can be superefficient when they incorporate appropriate smoothing information in their distributional assumptions.The resultant draws of Ymis cannot be sharper than those from the parent distribution and still lead to valid inferences for a variety of estimands, but multiple imputations of Ymis can be more efficient than the one true value of Ymis because of their multiplicity.For instance, in the canonical example, suppose that the multiple imputation procedure drew the group difference effect from a normal distribution centered

Confidence-Proper Multiple Imputation
We are now ready to provide an extended definition of proper imputation and state an extended result concerning frequency validity.Although the conditions and conclusion are similar to the major conclusions of Meng (1994), they are more direct and not as extensive since they avoid the issues of the ultimate user's incomplete-data procedure and congeniality between the imputer's and analyst's models.The definition of "confidence-proper" multiple imputation is still in terms of the complete-data statistics (Q, U), but involves averaging over both the response mechanism and the sampling mechanism and allows overestimation of between-imputation variability.
A multiple imputation procedure is confidence-proper for the complete-data statistics (Q, U) if the imputations are "first-moment proper" for (Q, U) in the sense of (3.2) and (3.5),In the canonical example, the strong superefficiency in the imputer's model for D implies that the data analyst's resultant repeated-imputation interval for D will have at least nominal coverage and hence will be confidence-valid; whether it is superior or inferior to other valid procedures depends on its interval length and the lengths of intervals from other confidence-valid procedures.

E(UOO[X,Y) -E(UIX
The conclusion, however is as before: try to impute using a Bayesian or approximate Bayesian model that tracks the data and the posited response mechanism-if you do this and your complete-data inference is confidence-valid, the result will be confidence-valid repeated-imputation inferences no matter how complex the survey design.

Confidence Validity Versus RandomizationValidity in Canonical Example
Fay (1991,1992) effectively claims that (a) wider 95% confidence intervals with exact 95% (asymptotic) coverage are superior to (b) narrower 95% confidence intervals with at least 95% coverage.Specifically, in the discussion of tables 1 and 2 of Fay (1992), summarized here in Table 2 after a bit of analysis to produce approximate coverage, the claim is made that the Rao and Shao (1992) (RS) procedure, using single-imputation hot deck, which results in uniformly wider intervals but with asymptotic coverage equal to the confidence coefficient, is inferentially superior to the multiple-imputation version of the same procedure (MI), which results in uniformly narrower intervals with asymptotic coverage at least as great as the confidence coefficient.Both procedures as reported are confidence valid, and I believe many statisticians and scientists would agree with Neyman's criteria and prefer sharper intervals with at least 95% coverage rather than wider intervals with exact 95% coverage.
Fay (1993) repeats the same criticism as Fay (1992) in more extreme examples (e.g., with up to 80% nonresponse) and labels the confidence coverage of the repeatedimputation inference as "punishingly conservative."But from the analyst's perspective, punishingly conservative relative to what alternative procedure?Presumably relative to what would have happened if the imputer had done what the analyst expected, that is, had used the analyst's model for imputation rather than be superefficient.But that would have led to wider intervals with exactly nominal coveragea valid procedure, but less preferred according to the Neyman definition and scientific criteria, than narrower intervals with greater coverage.
Of course, the confidence validity of the repeatedimputation inference does not mean it yields the best confidence-valid interval.By our mathematical analysis in this simple example we know that a shorter 95% confidence interval can be found with exact 95% coverage.Also, be-cause the procedure is confidence valid but not randomization valid, inefficiencies can arise when combining various estimates using the assigned precisions as weights.But finding a randomization-valid procedure in general requires extra work beyond the use of standard complete-data methods, and is generally impossible for the ultimate user unless extra information is conveyed by the data-base constructor.Furthermore, this whole issue seems relatively unlikely to arise in practice because knowledge of population parameters by the data-base constructor must be unusual.

Reaching Correct Conclusions When Evaluating Multiple Imputation
Several points are critical in reaching correct conclusions concerning multiple imputation.
First, when evaluating repeated-imputation inferences by analysis or simulation, we need to monitor whether the complete-data inference with no missing data is valid: multiple imputation for missing data cannot fix problems with complete-data analyses (e.g., poor coverage properties of the normal approximation for the sample mean with rare binomial trials, where, for example, logit transforms can lead to more accurate complete-data inferences); Rubin and Schenker (1986) and Ezzati-Rice et al. (1995) provide examples of such evaluations.Also note when evaluating these procedures with the number of respondents fixed (e.g., as in sec.4.3 and prob.4-18, in Rubin, 1987) that the resultant answers are conditional on these quantities, which in practice are random.Moreover, when doing evaluations treating the number of respondents as random, the theoretical variances of unbiased estimators can be undefined, since, for any finite sample size, with positive probability, all units will be nonrespondents; in such cases, it makes more sense to report coverage properties of interval estimates, which are defined (no respondents implies zero coverage) and the objects of statistical inference anyway.
Also important in reaching correct conclusions about multiple imputation is the treatment of estimated sampling variances as ancillary statistics rather than as estimates of scientific estimands.For example, Fay (1992) treated the ratio of repeated-sampling covariances as an estimand, and thereby was led to misunderstand the effect of superefficient imputation on inference.This illustrates why it is important not to confuse scientific estimands and ancillaries.In particular, Fay (1992, sec.3) states, in the context of the canonical example of Section 3.5: . the design-based approach gives 19 times the covariance of multiple imputation ... such a limitation, if general, imposes severe restrictions on the validity of the multiple imputation inferences for complex applications, such as Clogg et al. (1991).
Consider the true sampling variance-covariance ellipsoid for (y-O, doo) under the exchangeable normal repeatedimputation scheme and the sampling ellipsoid for (yO0 doo) assigned to it by the repeated-imputation inference; both have zero correlation because Yoo,a = (2yo ?doo) and Yoo,b =(2WOo-do) have equal variance.The repeatedimputation-assigned ellipsoid is outside because it touches the correct one at the two points along the yO axis but is wider along the doo axis.Using Fay's ratio of sampling covariances of Yoo,a and Yoo,b is equivalent to describing the difference between these two ellipsoids by the ratio of differences of variances (i.e., of eigenvalues) of 2-00 and doo in the two ellipsoids.The ratio of eigenvalues, or of variances in any direction, is relevant to inference, but the ratio of differences between eigenvalues, Fay's measure, is by itself, irrelevant.

COMPETING METHODS
If multiple imputations are proper (confidence proper) under the posited model for nonresponse, then using the repeated-imputation rules for combining complete-data statistics (Q, U) yields a randomization-valid (confidencevalid) final inference under the posited response mechanism, assuming that the complete-data inference was valid in the absence of nonresponse.And this holds no matter how complex the survey design.Moreover, the combining rules can be implemented using completely general software that is the same for all data sets and all completedata analyses.Thus multiple imputation and the repeatedimputation combining rules satisfy both the basic objective and the supplemental achievable objective.
Are there competing methods that, in some cases at least, also satisfy these objectives?Yes, but such competitors appear to me in general to have substantially greater deficiencies for the intended situation with ultimate users distinct entities from database constructors.These competitors are single imputation, multiple imputation with some analysis for the ultimate user other than the repeated-imputation inference, and weighting methods.For this to hold for each Q in such a range, the imputation method, single or multiple, must in general not only track the posited response mechanism but also must be a random draw method; otherwise, it cannot be first-moment proper for Q -y, Q s2, Q 25th percentile, etc.Consequently, any imputation method that satisfies the validity objective in generality must not only reflect the underlying response mechanism but must also be a random draw method.Nonrandom draw methods can be applied in special cases but require special analysis techniques.The most careful work on this topic of deterministic imputation of which I am aware concerns imputing probabilities for missing dichotomous variables (Schenker 1989;Schafer and Schenker 1991), and this work reveals the substantial extra effort that is needed, even in a special situation.
When an imputation method is a random-draw method, then multiple draws will automatically provide the basis for improved efficiency of estimation and more accurate inference, and are no more difficult to obtain than a single random-draw imputation.Thus multiple imputation is more attractive than single imputation, and the larger m the better, no matter how variances are to be calculated from the multiply-imputed data set.Little (1988) provides additional discussion of desiderata for creating imputations, which is consistent with this position.

Imputation in Random Independent
Replicates-An Alternative Suppose the sampling mechanism is such that the primary sampling units can be randomly divided into K replicate groups, each with the same sample design.Then with complete data, Q can be calculated in each replicate, and a valid (K -1) df estimate of the variance of the average of the K independent estimates, Q = EQ/K, can be found from their sample variance divided by K.This can be called the "random group estimator" (Wolter 1985).
This approach has been used with single imputation for missing data; I believe the method is appropriately attributed to Morris Hansen, but I cannot find the appropriate early reference (a relatively recent reference is Kalton 1983, pp. 112-123).Random-draw imputations are made in the K independent random replicates of the survey units, so that the variance of K values of Q on the imputed data is a K -1 df estimated variance of Q (or Q calculated on the full imputed data); this estimate reflects not only sampling variability but also increased variance due to imputation.In personal communications, Hansen realized the propriety of the use of multiple imputations within each independent replicate to reduce' variance due to imputation, and realized the potential tremendous loss of efficiency by doing the imputations independently in each independent replicate.In Rubin (1990), when discussing a related approach with energy data (Burns 1990), I called the resulting estimate of uncertainty an estimate of "evaluation variance" in contrast to "inferential variance" because it evaluates the variability of the estimation procedure, perhaps including excessive variability due to an efficient procedure used to handle missing data.
Assuming the requisite richness of survey data to allow the independent replicate procedure to be applied and assuming that the imputation method is first-moment proper, Hansen's method almost satisfies the basic and validity objectives, without needing the second-moment conditions involved in proper or confidence-proper imputations; I say "almost" because the ultimate user must be willing to forgo variance estimation aspects of the complete-data analysis programs, and rely on the potentially far less efficient variance estimation via the replicates, which does not fully satisfy the basic objective.Nevertheless, the lack of need for second-moment conditions for valid variance estimation is a potential advantage relative to relying on the repeatedimputation inference.Some experience suggests, however, that these potential benefits often cannot be realized because two kinds of inefficiencies arise.First, because imputations are required to be independent within each of K replicates, there is 1/Kth the amount of data used for modeling imputations as actually available.Second, small K implies very poor variance estimation, and often the largest possible K is truly 1, so that actual independent replicates cannot be used when trying to apply the method.
I believe that Hansen agreed that the independent replicate approach was generally inadequate and that the Bayesian multiple imputation approach is necessary to handle missing data in surveys: Olkin: Have you become involved with Bayesian statistics or other techniques developed within the last ten years?
Hansen: Not really.I guess I endorse and approve the kind of thinking that Don Rubin has been doing.
Olkin: With respect to missing observations?Hansen: Yes, in missing observations.Sometimes it's necessary to do modeling in sample surveys, where probability sampling methods aren't applicable as in the case of the imputation for nonresponse.We certainly have been involved in such methods.In general, I can't say' that we have been working in that area very much.However we are interested in the potential in that setting.

Imputations in Hypothetical Independent
Replicates-Another Alternative One way to try to get around these inefficiencies is to try to do first-moment proper (multiple) imputation in K nonindependent samples, i.e., jackknife or bootstrap replicates (e.g., Burns 1990;Efron 1994).This is an interesting and useful idea, but it has limitations in our context.If the database constructor is to provide the imputations for the ultimate user, there must be a set of imputations for each of the K jackknife or bootstrap samples chosen by the data-base constructor, where K should be substantial for stable variance estimation (e.g., 100 or more).Moreover, if K -100 replicate data sets are considered too many to provide, then the data-base constructor must include with the data base the software to be applied by the ultimate user to create the imputations on each of the ultimate user's jackknife or bootstrapped samples-in this case, superior imputations based on confidential or detailed information must be forgone.Also, as with independent replication, the basic objective is not fully satisfied for point or variance estimation, and more work is required of the ultimate user than with a multiply-imputed data set.Moreover, the variance estimation can be inaccurate inferentially, reflecting excessive procedural variance (see, e.g., Rao and Shao 1992, p. 813, and Burns 1990; incidentally, subsequently Burns found that multiple imputation worked well relative to replicate imputation, Burns 1991Burns , 1993)).
If neither the data-base constructor's bootstrap/jackknife imputations nor the data-base constructor's imputation software is delivered to the ultimate user, this approach effectively throws the entire problem into the ultimate user's lap, who may well do some sort of misguided imputation, which is not even first-moment proper, take bootstrap or jackknife replicates and assume inferential validity despite badly biased estimates of scientific estimands (see, e.g., Rubin 1994, andEfron 1994, for differing views concerning the acceptability of such answers).

Other Imputation-Based Procedures
Rao and Shao (1992) provide a careful analysis of how to use the jackknife to adjust analyses when missing data have been singly imputed by a particular hot-deck procedure.This addresses an important problem because in current practice many public-use files have been singly imputed by the hot deck.But the ultimate user bears the burden of substantial extra work, because "special computations have to be performed to adjust the imputed values for each pseudoreplicate before applying the standard jackknife variance formula" (Rao and Shao 1992, p. 813), and new mathematical analysis and new software apparently must be developed for each new distinct situation (estimator x missing data pattern x survey design x imputation method).Consequently, this approach, at least at present, fails to satisfy the basic objective of relying only on complete-data analyses and general routines.
Fay's work is something of a moving target, with a variety of older and newer suggestions, which are described with little generality and under special assumptions (e.g., missing completely at random).For example, Fay (1996) seems now to accept multiple imputation as being superior to single imputation (and perhaps to standard weighting adjustments) but advocates creating "improper" multiple imputations and recommends analysis by weighting the data from the completed units in one analysis rather than using the repeated-imputed inference.Recommending creating "improper" multiple imputations is suggesting what we should not do, but it is not a prescription for doing anything in particular.Presumably, it refers to first-moment proper multiple imputation (because without this even point estimation can be badly biased) but without concern for the second-moment conditions (e.g., fixing parameters at point estimates rather than drawing them from their posterior distributions, as in Rubin 1987, ex.4.1, prob.13 in chap. 1, and prob.46 in chap.5).But this is not even defined in multistage complex surveys with clusters where valid imputation models need to be hierarchical, typically with levels of parametric structure: I know what it means to try to be proper in complex surveys by following a Bayesian analysis with variables for the survey structure included in the modelling, but I do not know what the advice to "not do this" means.Also consider the example in Rubin (1983, sec.2.8, also described in Gelman, Carlin, Stern and Rubin 1995, chap.15), which stimulated the methods in Clogg et al. (1991) and illustrates the need to be Bayesian and include variability in parameter estimation in order to obtain valid frequency inference.
Finally, consider the suggestion in Fay (1996) that the analysis of a multiply-imputed data set should proceed by replacing each incomplete unit with multiply-imputed versions of that unit's data with split weights.I considered and discarded this idea in Rubin (1977b; also see Rubin 1987, prob.4-29, and the rejoinder in Meng 1994), because it seemed to have merit as a method of analysis only in simple cases (see, e.g., Little 1979).For valid analysis in general, I believe that such an approach requires extra routines for different complete-data analyses, and so fails to satisfy the basic objective.As a method for storing the multiply-imputed data sets, it can take substantially more memory than the standard form because all the observed data for units with some missing data are stored many times instead of just once.Nevertheless, I would certainly be interested in seeing any work that suggests I rejected this idea prematurely, and that in fact, it can be made to work for any posited response mechanism, complex survey, and complete-data analysis, with only the addition of completely general macros.

Conclusions Regarding Alternative Imputation Strategies
Given a situation with a single imputation method that is first-moment proper for many statistics, it is almost certainly a random-draw method, and then multiple imputations are easily created, and these are the basis of more accurate inference.Then the only reason not to create them and recommend to the ultimate user that the multiplyimputed data be analyzed using repeated-imputation combining rules is fear that the imputation method, although first-moment proper, is not fully proper for some analyses.If it is not proper but is confidence proper, the only legitimate fear is lost power and overcoverage, as due to superefficiency.But then another method is needed for the ultimate user to recover such superefficiency-I believe special methods for different situations.Are such special efforts needed?All realistic examples I know suggest that in practice the overcoverage is slight and a minor issue relative to omitted variables that can lead all methods astray because of biased estimation and undercoverage.General theory and examples suggest that second-moment properness of Bayesianly-motivated multiple imputation procedures typically follows automatically if the method is first-moment proper (see, e.g., Huber 1976, and results referenced in Rubin 1987, sec.2.10).Nevertheless, more work on this issue is desirable and could make general theoretical contributions to understanding the robustness of Bayesian inference.
My conclusion when doing imputation is to do multiple imputation under carefully chosen models and use the repeated-imputation inference for analysis.Of course, more theoretical development is still desirable on such issues as: implicit imputation models that reflect both the uncertainty of parameter estimation and the uncertainty of the values to impute given a specific predictive fit (van Buuren, van Rijckevorsel, and Rubin 1993); models for sequential imputation (Kong, Liu, and Wong 1994;Liu and Chen 1995); the use of importance weights (Meng 1994); improved small mn combining rules in especially difficult cases (Barnard 1995); and the development of realistic nonignorable models for particular settings.
, and editorial reviewers.Also, thanks are due to R. E. Fay for his continuing interest in multiple imputation and for his special examples, which helped stimulate the formulation here of superefficient multiple imputation and the associated new results.Finally, David Binder's comments on presentations of this material are gratefully acknowledged.ence (in the traditional complex survey sense of Neyman, Cochran, and Hansen) in the difficult real-world situation where (1) ultimate users and data-base constructors are distinct entities with different analyses, models, and capabilities, and (2) there typically is no one accepted reason for the missing data.

E
, "randomization validity" means that, for interval estimates, "actual interval coverage = nominal interval coverage," and for tests of hypotheses, "actual rejection rate = nominal rejection rate."Randomization validity is the natural objective in most survey contexts.In standard asymptotic situations, a complete-data estimate Q of an estimand Q has a normal sampling distribution centered at Q with sampling variance (or more generally, variance-covariance) consistently esti-mated by the statistic U, where the randomization distribution is that generated by the sampling indicator I given fixed (X, Y)-the sampling mechanism.In this case we have draws from the posterior predictive distribution of the missing values under a specific model, that is, a particular Bayesian model for both the data and the missing-data mechanism.The m complete-data analyses corresponding to the m imputations under one model result in m repeated completed-data statistics, and these are combined to form one repeated-imputation inference that appropriately adjusts for nonresponse under the model used to create the repeated imputations.The values of the completedata statistics Q and U calculated on the m completed data set are Q*i, .. ,Q*m and U*1,... , *m The basic proce- dures for combining the m estimates {Q*i ... v Q*m}, as- sociated variance-covariance matrices { U*1, I... U*m }, and p values, that is, the final repeated-imputation inferences, are derived in chapter 3 in Rubin (1987) under the Bayesian paradigm for survey inference (introduced in chap. 2 of Rubin 1987), assuming that the multiple imputations are repeated imputations.
from 128.114.48.131 on Thu, 3 Oct 2013 19:22:08 PM All use subject to JSTOR Terms and Conditions fully observed sampling indicators for which values of Y were included in the survey for observation, and R is the array of fully observed indicators for response (i.e., for which components of Y that were intended to be observed were observed).A component, Yij, is observed if both associated indicators, Iij and Ri3 are one, and is not observed if either is zero.This perspective is called the random-response randomization-based perspective.2.4 Proper Multiple ImputationA key concept underlying these randomization-based evaluations is that of proper multiple imputation, whose mathematical definition is purely frequentist, since it involves expectations given that the population values (X, Y) are fixed.The crucial result is that when (1) the multiple imputations are proper for (Q, U), and (2) the completedata inference based on (Q, U) is randomization-valid for Q, then the large-m repeated-imputation inference given by (2.5) is randomization-valid for the scientific estimand Q, no matter how complex the survey design.Whether a multiple imputation procedure is proper depends, in general, on which complete-data estimates, Q, and associated variance-covariance matrices, U, are being considered.The full definition is given in Rubin (1987, pp.118-119); it is summarized here ignoring the more technical conditions in order to focus attention on three essential conditions.The definition of a proper multiple imputation procedure treats (X, Y) and the intended sample (as indicated by I) as fixed [except for a minor technical condition-eq.(4.2.9) in Rubin 1987], and deals with the fixed but unknown values of the complete-data statistics (Q, U) in the sample as if they were estimands.That is, the randomization distribution critically involved in the definition of proper multiple imputation is generated by the response mechanism, in which X, Y, and I are fixed, and R is the random variable.Because the conditions for proper imputation involve large m, the simplified definition only involves expectations with respect to the response mechanism.For proper imputation, the values of the complete-data statistics Q and U created by filling in the missing Y values, that is Q*j and U*j, must be approximately unbiased for their complete-data analog Q and U; that is, in terms of , which is the variance-covariance of the Q*1 across the m imputations, must be approximately unbiased for the randomization variance of QOO: for proper imputation is analogous to (1.1) for randomization validity: both require approximate unbiasedness of the estimate (Qo or Q) for its estimated (Q or Q) over its randomization distribution (induced by the response mechanism or the sampling mechanism).Equation (2.8) for proper imputation is analogous to (1.2) for randomization validity: both require approximately unbiased estimation by the ancillary statistic (Boo or U) for the variance of the estimate (QcO or Q) over its randomization distribution (induced by the response or sampling mechanism).Also, just as (1.1) and (1.2) together imply (at least in largesample surveys) that randomization-valid inferences for Q can be based on the approximation (2.6) and (2.8) together imply that randomization-valid inferences for the complete-data statistics Q can be based on the approximation (Qco IX, Y, I) N(Q, Boo),I where the randomization distributions are induced by the sampling and response mechanisms, respectively.The remaining condition for proper imputation has no direct analog in complete-data randomization validity: expression (2.7) implies that the complete-data ancillary statistic U, being treated as an ancillary complete-data estimand for the definition of proper imputation, is approximately unbiasedly estimated after imputation.
Result 4.1: If the complete-data inference is randomization-valid and the multiple-imputation procedure is proper, then the infinite-m repeatedimputation inference is randomization-valid under the posited response mechanism.(Rubin1987, p. 119).This result follows from combining the formal versions of (chap.4) presented analytic results, simulation evaluations, and many examples of proper and improper multiple imputation methods, where the evaluations were all from the random-response randomization-based frequentist perspective.The trick in many of the examples of proper imputation was to get the variance condition (2.8) correct, and it was shown that when drawing imputations to approximate repetitions from a sensible Bayesian model, conditions (2.6)-(2.8)typically followed automati-cally.The more straightforward conditions, (2.6) and (2.7), typically were simple properties of any intelligent imputation scheme that tried to track the data.An example of a method that does not track the data is "fill in the mean," which although it may satisfy (2.6) for Q = y, fails to do so for Q = s or for Q = 25th percentile, or to satisfy (2.7) for U = s2/n, etc. Hot-deck (Bootstrap) and randomdraw regression methods tend to satisfy (2.6) and (2.7) but fail to satisfy (2.8) until a Bayesian, systematic betweenimputation component of variability is added (e.g., via the Bayesian Bootstrap, Rubin 1981), to reflect uncertainty in the estimation of population parameters.The view in 1987, which I still hold today, was summarized as follows.Conclusion 4.1: If imputations are drawn to approximate repetitions from a Bayesian posterior distribution of YmiS under the posited response mech- sec.7) explicitly consider the situation in the early industry and occupation example where some information used by the imputer (the original double-coded sample) is not available to the data analyst, and demonstrate the resulting potential conservative coverage.Also, the evaluations of the results of this project include cases where the data analyst uses variables not used by the imputer and, for this data set and practical analyses, find no deleterious consequences (Schenker, Treiman, and Weidman 1993; Treiman, Bielby, and Cheng 1989; Weld 1987).Careful and extensive evaluations of this general situation, involving variables omitted by the imputer, are also included in work conducted at ETS in the context of NAEP, which for a decade has created multiply-imputed public-use data bases (e.g., Mislevy, Johnson, and Muraki 1992).
Take Too Much Work to Create Proper or Approximately Proper Multiple Imputations?Again, my response to this question is "too much relative to what?"It certainly takes much more work than some methods that have no general validity.But multiple imputation takes little more work than other methods that attempt to address nonresponse validly and with some generality.Moreover, essentially all the extra work is needed from the data-base constructor, who may have the resources to do the job well, rather than the world of ultimate users with their varied and limited resources.In fact, some experience suggests that in practice it may be substantially easier to do model-based multiple imputation than to use previous approaches because we can apply powerful methods of direct and indirect simulation under full probability models (e.g., data augmentation, the Gibbs sampler) and let the computer do much of the work previously done by expensive and exhausting human iteration; consider, for example, the recent project dealing with nonmonotone missing data patterns in NHANES (Fahimi and Judkins 1993; Schafer et al., 1993; Ezzati-Rice et al. 1993; Johnson et al. 1993; and Little and Rubin 1993).For other examples dealing with the creation of multiple imputations and related issues, consider Kennickell (1991); Chand and Alexander (1994); Paulin and Ferraro (1994); and Eltinge, Yansaneh and Paulin (1994).
equals zero.The obvious complete-data estimators are y = (Ya + Yb)/2 for Y and d = (Ya -Yb) for D, with associated standard complete-data variance estimates Ug and Uj, respectively, which result in randomization-valid complete-data inferences, at least for large n.
,b) rather than at Yobs,a -Yobs,b (as when this effect is directly estimated from the data) or at zero (as with the strongly superefficient imputations of sec.3.5).These imputations would effectively be additional data values, which could contribute to a better estimate of D, even if the actual missing values were found.The general definition of superefficient imputations for Q replaces (3.3) with cov(QOO,QIX, Y) < var(QjX, Y); (3.4) strong superefficiency implies supereffciency because (3.1) and (3.3) imply (3.4).
Validity.If the complete-data inference based on (Q, U) is confidence valid and the multiple imputation procedure is confidence proper for (Q, U), then the repeated-imputation inference is confidence valid with E that the imputation method must be first-moment proper, in the sense of (3.2), for a variety of statistics Q, for example Q = sample mean, sample variance, median, 25th percentile, factor loadings, and these quantities within strata, domains, subdomains, etc.
weighting adjustments for nonresponse, which in principle, can be a very effective class of methods for obtaining approximately unbiased estimates.Each unit's weight is the inverse probability of observing that unit's pattern of missing data given (X, Y) information.If the patterns of missing data for the units are created by design, as with matrix sampling, these probabilities and thus the weights are known.When these patterns of missing data are affected by nonresponse, the nonresponse probabilities need to be estimated.Although this estimation can be undertaken by the data-base constructor, typically it is only done assuming the simplest case of nonresponse where the units are either respondents (with all of Y observed) or nonrespondents (with all of Y missing); in this case, the nonrespondents can be discarded, and (approximately) unbiased estimates can be obtained from the respondents and their weights, assuming they accurately reflect both the sampling and nonresponse mechanisms.Several issues arise with the use of weighting adjustments.First, even in the simplest case of unit nonresponse, where the shared data base of respondents is fully observed, many ultimate users' complete-data analyses do not allow for sampling weights.Second, even with completedata analyses that can deal with sampling weights, the construction of intervals and p-values that validly account for the fact that nonresponse adjustments in the weights are estimated from data are not immediate from complete-data analyses.Third, with general patterns of nonresponse, special analysis methods need to be developed and special software needs to be written-see Little 1988, sec. 5.1 for the case of monotone missing data, but attempting to do this in a manner that allows the use of standard complete-data software leads to ad hoc approaches such as "complete cases" and "available cases," which we have already rejected as unacceptable general solutions.These three issues imply that in general, weighting adjustments do not satisfy the objectives of allowing ultimate users to apply standard completedata software to shared data bases to obtain valid inference.A fourth issue with such weighting adjustments is that they are focused on unbiased estimation and are essentially blind to efficiency concerns.In most well-designed surveys, the planned pattern of missing data is such that efficient estimates are expected to result from standard weighted estimates.But nonrespondents do not necessarily create missing data in such a benign way, and so standard weighted estimates, even when approximately unbiased, can have excessive variability.Consider dealing with censored data by weighting-data beyond or approaching the censoring point have zero or very small probabilities of being observed, and so either cannot be dealt with by weighting or imply a few observations with dominant weights.Weighting by inverse probabilities cannot create estimates outside the convex hull of the observed data, and estimates involving weights near the boundary have extremely large variance.For these reasons, weighting, although theoretically attractive in an asymptotic sense, has never really been claimed to be a complete practical solution to the prob-lem of missing data in shared data bases; recall Hansen's (1987) comments reported in Section 4.2.