Articles | Open Access | https://doi.org/10.55640/

Integrating Survey Data through Matched Mass Imputation: A Comprehensive Approach

Dr. Sofia I. Petrov , PhD, Department of Applied Mathematics and Statistics, Moscow State University, Russia

Abstract

The increasing availability of non-probability samples, often collected rapidly and cost-effectively (e.g., through web surveys), presents both opportunities and challenges for statistical inference. While probability samples remain the gold standard for unbiased estimation, their cost and declining response rates necessitate innovative methods for integrating data from diverse sources. This article explores Matched Mass Imputation (MMI) as a robust and efficient approach for combining information from a traditional probability sample with a larger, auxiliary non-probability sample. We detail the methodological framework of MMI, which leverages matching techniques to identify suitable donors from the non-probability sample for recipients in the probability sample, followed by mass imputation of unobserved variables. This approach aims to mitigate biases inherent in non-probability samples and enhance the precision of estimates by effectively utilizing the larger sample size. We discuss the theoretical underpinnings, practical implementation considerations, and the conditions under which MMI can yield reliable inferences, including the crucial common support assumption and the role of statistical learning methods. By synthesizing recent advancements, this paper demonstrates MMI's potential to provide a powerful and flexible solution for modern survey data integration, balancing the need for accuracy with the realities of data collection in an evolving landscape.

Keywords

Survey data integration, matched mass imputation, data harmonization

References

Beaumont JF, Rao J (2021). Pitfalls of making inferences from non-probability samples: Can data integration through probability samples provide remedies? The Survey Statistician, 83: 11–22.

Bethlehem J (2016). Solving the nonresponse problem with sample matching? Social Science Computer Review, 34(1): 59–77. https://doi.org/10.1177/0894439315573926

Centers for Disease Control and Prevention (CDC) (2015–2020). NHANES - National Health and Nutrition Examination Survey. https://www.cdc.gov/nchs/nhanes/index.htm (visited: 2023-10-11).

Chen S, Yang S, Kim JK (2022). Nonparametric mass imputation for data integration. Journal of Survey Statistics and Methodology, 10(1): 1–24. https://doi.org/10.1093/jssam/smaa036

Chen Y, Li P, Wu C (2020). Doubly robust inference with nonprobability survey samples. Journal of the American Statistical Association, 115(532): 2011–2021. https://doi.org/10.1080/01621459.2019.1677241

Dever J (2018). Combining probability and nonprobability samples to form efficient hybrid estimates: An evaluation of the common support assumption. In: Proceedings of the 2018 Federal Committee on Statistical Methodology (FCSM) Research Conference, 1–15.

Hájek J (1964). Asymptotic theory of rejective sampling with varying probabilities from a finite population. The Annals of Mathematical Statistics, 35(4): 1491–1523.

Horvitz DG, Thompson DJ (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260): 663–685. https://doi.org/10.1080/01621459.1952.10483446

James G, Witten D, Hastie T, Tibshirani R, et al. (2013). An Introduction to Statistical Learning, volume 112. Springer.

Kalay AF (2021). Double Robust Mass-Imputation with Matching Estimators. arXiv preprint: https://arxiv.org/abs/2110.09275.

Kern C, Li Y, Wang L (2021). Boosted kernel weighting–using statistical learning to improve inference from nonprobability samples. Journal of Survey Statistics and Methodology, 9(5): 1088–1113. https://doi.org/10.1093/jssam/smaa028

Kim JK, Park S, Chen Y, Wu C (2021). Combining non-probability and probability survey samples through mass imputation. Journal of the Royal Statistical Society. Series A. Statistics in Society, 184(3): 941–963. https://doi.org/10.1111/rssa.12696

Lee BK, Lessler J, Stuart EA (2011). Weight trimming and propensity score weighting. PLoS ONE, 6(3): e18174. https://doi.org/10.1371/journal.pone.0018174

Li Y, Fay M, Hunsberger S, Graubard BI (2023). Variable inclusion strategies for effective quota sampling and propensity modeling: An application to sars-cov-2 infection prevalence estimation. Journal of Survey Statistics and Methodology, 11(5): 1204–1228. https://doi.org/10.1093/jssam/smad026

Little RJ (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404): 1198–1202.

Lohr SL (2021). Sampling: Design and Analysis. Chapman and Hall/CRC.

Maia M, Azevedo AR, Ara A (2021). Predictive comparison between random machines and random forests. Journal of Data Science, 19(4): 593–614. https://doi.org/10.6339/21-JDS1025

National Academies of Sciences, Engineering, and Medicine (2018). Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. National Academies Press.

Rivers D (2007). Sampling for web surveys. American Statistical Association, Alexandria, VA, 1–26.

Rubin DB (1976). Inference and missing data. Biometrika, 63(3): 581–592. Publisher: Oxford University Press. https://doi.org/10.1093/biomet/63.3.581

Särndal CE, Swensson B, Wretman J (2003). Model Assisted Survey Sampling. Springer Science & Business Media.

Scott DW (2009). Sturges’ rule. Wiley Interdisciplinary Reviews. Computational Statistics, 1(3): 303–306. https://doi.org/10.1002/wics.35

Stuart EA (2010). Matching methods for causal inference: A review and a look forward. Statistical Science, 25(1): 1. https://doi.org/10.1214/09-STS313

Stuart EA, King G, Imai K, Ho D (2011). MatchIt: Nonparametric preprocessing for parametric causal inference. Journal of Statistical Software, 42(8): 1–28. https://doi.org/10.18637/jss.v042.i08

Sturges HA (1926). The choice of a class interval. Journal of the American Statistical Association, 21(153): 65–66. https://doi.org/10.1080/01621459.1926.10502161

Wang L, Graubard BI, Katki HA, Li Y (2020). Improving external validity of epidemiologic cohort analyses: A kernel weighting approach. Journal of the Royal Statistical Society. Series A. Statistics in Society, 183(3): 1293–1311.

Wang YH (1993). On the number of successes in independent trials. Statistica Sinica, 3(2): 295–312.

Wiśniowski A, Sakshaug JW, Perez Ruiz DA, Blom AG (2020). Integrating probability and nonprobability samples for survey inference. Journal of Survey Statistics and Methodology, 8(1): 120–147. https://doi.org/10.1093/jssam/smz051

Wood SN (2017). Generalized Additive Models: An Introduction with R. CRC Press.

Yang S, Kim JK (2020). Statistical data integration in survey sampling: A review. Japanese Journal of Statistics and Data Science, 3: 625–650. https://doi.org/10.1007/s42081-020-00093-w

Yang S, Kim JK, Hwang Y (2021). Integration of data from probability surveys and big found data for finite population inference using mass imputation. Survey Methodology, 47(1): 29–58.

Yang S, Kim JK, Song R (2020). Doubly robust inference when combining probability and non-probability samples with high dimensional data. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 82(2): 445–465. https://doi.org/10.1111/rssb.12354

Article Statistics

Downloads

Download data is not yet available.

Copyright License

Download Citations

How to Cite

Integrating Survey Data through Matched Mass Imputation: A Comprehensive Approach. (2025). International Journal of Mathematics and Statistics, 5(01), 9-14. https://doi.org/10.55640/