ISSN : 1229-067X
통계 모형을 이용한 데이터 분석에서 적절한 결측치 처리가 최종적인 통계적 결론 도출에 결정적인 영향을 미칠 수 있다는 것은 이미 잘 알려져 있다. 따라서 현재까지 다양한 결측치 처리 기법이 제안되어왔고, 그 중 최근 활발히 연구되고 있는 기법으로서 완전정보최대우도(full information maximum likelihood), 다중대체(multiple imputation), 그리고 베이지안(Bayesian) 접근법이 있다. 이 중 완전정보최대우도법은 사회과학 통계에서 가장 많이 사용되는 방법으로서 많은 심리학 연구자들에게도 이미 알려져 있는 방법이나, 다른 두 방법, 즉 다중대체 및 베이지안 접근법은 아직은 많은 심리학 연구자들에게 생소한 방법이다. 따라서 본 논문에서는 이 세 가지 방법을 공분산 구조모형의 맥락에서 소개하고 주요 특징을 비교함으로써 심리학 연구자들의 결측치 처리에 관한 이해를 돕고자 하였다. 공분산구조모형의 적용 과정은 다른 통계모형과 달리 자료와 모형간의 적합도 지수를 계산하고 이를 바탕으로 모형의 적절성을 판단하는 과정을 포함한다. 최대우도법에서는 카이자승 통계량을 기준으로 다양한 적합도 지수가 파생되어 제안되었으며, 다중대체법은 D2와 D3 통계량이, 그리고 베이지안 접근법에서는 사후예측모형검증(posterior predictive model checking) 기법이 사용된다. 최대우도법의 카이자승 통계량은 다양한 맥락에서 독립적으로 연구된 결과가 있으나, 다중대체법 및 베이지안 접근법에서 제안된 방법은 공분산 구조모형의 맥락에서 그 성능이 평가된 적이 없으며 서로 비교된 적도 없다. 따라서, 본 논문에서는 동일한 구조방정식모형의 자료-모형간 적합성 판단에 있어서 세 가지 다른 결측치 처리 방법이 어떤 영향을 미치는지 비교 분석하였다. 구체적으로, 본 논문에서는 모의실험(simulation) 기법을 사용하여, 모집단에서 데이터가 종단적 측정 불변성을 지지하지 않는 경우(longitudinal measurement non-invariance)를 가정하고, 옳은 모형을 사용한 경우의 제 1 종 오류율과 부분측정불변성(partial measurement invariance)을 가정한 모형을 적용한 경우의 검정력을 최대우도, 다중대체, 그리고 베이지안 접근법 별로 추정/비교하였다. 본 연구의 결과는 결측치가 존재하는 데이터를 공분산 구조모형을 이용하여 분석하고자 하는 연구자들에게 결측치 처리 기법의 특성을 이해하고 최종 모형의 적합성을 판단하는데 적절한 지침을 제공할 것으로 기대된다.
In practical applications of any statistical modeling, including structural equation modeling(SEM), virtually every data set contains missing values. It is a well known fact that improper handling of missing data can exert harmful impact on subsequent statistical inferences in a variety of ways to varying degrees. In the context of SEM, the full information maximum likelihood(FIML) has been arguably the most popular method for addressing missing data. Despite of being yet less widely known to majority of applied researchers as flexible alternatives to FIML, multiple imputation (MI) procedures and Bayesian approaches have recently begun to emerge as viable solutions among many applied researchers. An important objective of this article is to introduce these methods to applied researchers in an accessible manner using SEM as the context. Structural equation modeling actually involves the process of proposing, estimating, and evaluating the researcher’s hypothesis that is believed to be underlying and purported in generating the observed data. Therefore, it is essential to evaluate the overall goodness-of-fit of the posited model in any given application. FIML, MI and Bayesian approaches, respectively, yield the chi-square, , , and the posterior predictive modeling checking (PPMC) p-value as statistical tools for the assessment of data-model fit. Another important objective of this article is to study performance of these model evaluation tools in the context of SEM. Further, relative performance of these data-model fit assessment tools is to be evaluated with respect to their Type I error rates and power. The performance of these assessment tools, except the chi-square statistics, has never been evaluated nor been compared within the context of SEM. The initial results provided in the present article is believed to not only enhance the knowledge base regarding the characteristics of these assessment tools under missing data, but also provide an initial guideline for the proper use of these assessment tools in the real-world data analysis especially in the application of SEM with missing data.
Allison, P. D. (1987). Estimation of linear models with incomplete data. Sociological methodology, 17, 71-103.
Allison, P. D. (2003). Missing data techniques for structural equation modeling. Journal of abnormal psychology, 112(4), 545-557.
Arbuckle, J. L. (1996). Full informaton estimation in the presence of incomplete data. In G. A. Marcoulides & R. E. Schumaker (Eds.), Advanced structural equation modeling (pp. 243-277). Mahwah, NJ: Lawrence Erlbaum Associates.
Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of covariance structures. Psychological bulletin, 88(3), 588-606.
Bodner, T. E. (2008). What improves with increased missing data imputations?. Structural Equation Modeling, 15(4), 651-675.
Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. Sage Focus Editions, 154, 136-136.
Dahl, F. A. (2006). On the conservativeness of posterior predictive p-values. Statistics &Probability Letters, 76, 1170-1174.
Enders, C. K. (2001). The performance of the full information maximum likelihood estimator in multiple regression models with missing data. Educational and Psychological Measurement, 61(5), 713-740.
Finkbeiner, C. (1979). Estimation for the multiple factor model when data are missing. Psychometrika, 44(4), 409-420.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis. CRC press.
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. Pattern Analysis and Machine Intelligence, IEEE Transactions on, (6), 721-741.
Gelman, A., Meng, X. L., & Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica sinica, 6(4), 733-760.
Gold, M. S., & Bentler, P. M. (2000). Treatments of missing data: A Monte Carlo comparison of RBHDI, iterative stochastic regression imputation, and expectation-maximization. Structural Equation Modeling, 7(3), 319-355.
Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention Science, 8(3), 206-213.
Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6(1), 1-55.
Larson, R. (2011). Missing data imputation versus full information maximum likelihood with second-level dependencies. Structural Equation Modeling, 18(4), 649-662.
Lee, S. Y. (2007). Structural equation modeling: A Bayesian approach (Vol. 711). John Wiley &Sons.
Li, K. H., Meng, X. L., Raghunathan, T. E., & Rubin, D. B. (1991). Significance levels from repeated p-values with multiply-imputed data. Statistica Sinica, 1(1), 65-92.
Little, R. J. & Rubin, D. B. (2002). Statistical analysis with missing data. Wiley.
Meng, X. L., & Rubin, D. B. (1992). Performing likelihood ratio tests with multiply-imputed data sets. Biometrika, 79(1), 103-111.
Olinsky, A. Chen, S. & Harlow, L. (2003). The comparative efficacy of imputation methods for missing data in structural equation modeling. European Journal of Operational Research, 151(1), 53-79.
Robins, J. M., van der Vaart, A., & Ventura, V. (2000). Asymptotic distribution of p values in composite null models. Journal of the American Statistical Association, 95, 1143-1156.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581-592.
Rubin, D. B. (1977). Formalizing subjective notions about the effect of nonrespondents in sample surveys. Journal of the American Statistical Association, 72(359), 538-543.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York, New York: Wiley.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. CRC press.
Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological methods, 7(2), 147.
Sinharay, S., Stern, H. S., & Russell, D. (2001). The use of multiple imputation for the analysis of missing data. Psychological methods, 6(4), 317-329.
Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American statistical Association, 82(398), 528-540.
Tucker, L. R., & Lewis, C. (1973). A reliability coefficient for maximum likelihood factor analysis. Psychometrika, 38(1), 1-10.
Yuan, K-H., Yang-Wallentin, F. & Bentler, P. M. (2012). ML versus MI for missing data with violation of distribution conditions. Sociological Methods & Research, 41(4). 598-629.
Van der Vaart, A. W. (2000). Asymptotic statistics (Vol. 3). Cambridge university press.
Van Buuren, S. (2012). Flexible imputation of missing data. CRC press.
Wilks, S. S. (1932). Moments and distributions of estimates of population parameters from fragmentary samples. The annals of Mathematical Statistics, 3(3), 163-195.