How to deal with missing value is always the first problem we need to consider when we have a set of data with missing value. So, this article will talk about how many types of missing value and how to deal with it.
How to deal with missing value is always the first problem we need to consider when we have a set of data with missing value. Generally, there have 3 type of missing value: Missing Complete at Random (MCAR), Missing at Random (MAR) and Missing Not at Random (MNAR). The different situation has different way to handling. In this , they compare five different imputation methods (i.e., RF, kNN, SVD, Mean, Median) for MCAR/MAR and six imputation methods (i.e., QRILC, Half-minimum, Zero, RF, kNN, SVD) for MNAR. (Since MCAR and MAR are similar, they consider them as same category.) And results showed that RF imputation performed the best for MCAR/MAR and QRILC was the favored one for MNAR. So, this article will discuss 3 types of missing value and evaluate different imputation methods.
So, next, let’s see what different of 3 types of missing value.
Difference of MCAR, MAR, and MNAR
Missing Complete at Random (MACR): the propensity for a data point to be missing is completely random. There’s no relationship between whether a data point is missing and any values in the data set, missing or observed. The missing data are just a random subset of the data.
Missing at Random (MAR): the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data. For example, in a survey, poor subjects were less likely to answer a question about drug use than wealthy subjects. To have a better understanding, we can consider MAR as Missing Conditionally at Random, because the missingness is conditional on another variable. Since those two types of missing value are very similar and hard to distinguish them, we will put them together to analyze. “Missing Completely at Random” and “Missing at Random” are both considered ‘ignorable’ because we don’t have to include any information about the missing data itself when we deal with the missing data.
Missing Not at Random (MNAR): there is a relationship between the propensity of a value to be missing and its values. MNAR is called “non-ignorable” because the missing data mechanism itself has to be modeled as you deal with the missing data. You have to include some model for why the data are missing and what the likely values are.
Figure 1: Figure from https://www.sciencedirect.com/science/article/pii/S0169534708002772#fig1
Figure 1. Illustrations of the classification for the mechanism of missing data. Blue points are observations whereas red points are missing observations in the y-variable; statistics for complete data (blue and red combined) are slope (b) = 1, standard error (se) = 0.05 and R2 = 0.5. Assuming observations in the x-variable are complete, (a) represents missing at random (MAR), (b) represents missing not at random (MNAR) and (c) represents missing completely at random (MCAR). For the observed data (blue points), the estimated slope, se and R2, are (a) b = 0.86, se = 0.11, R2 = 0.29, (b) b = 0.432, se = 0.06, R2 = 0.23 and (c) b = 0.957, se = 0.07, R2 = 0.49.
Evaluation of imputation
Figure 2: Figure from https://www.nature.com/articles/s41598-017-19120-0/figures/1#search-menu
Fig.2 Evaluation of different imputation methods for MCAR/MAR (a,b) NRMSE on unlabeled and labeled metabolomics data. (c,d) PCA-Procrustes sum of squared errors on unlabeled and labeled metabolomics data. (e) Pearson correlation of log p-values (t-test) of complete data and imputed data. (f) PLS-Procrustes sum of squared errors.
Figure 3: Figure from https://www.nature.com/articles/s41598-017-19120-0/figures/1#search-menu
Fig. 3 Evaluation of different imputation methods for MNAR (a,b) SOR on unlabeled and labeled metabolomics data. (c,d) PCA-Procrustes sum of squared errors on unlabeled and labeled metabolomics data. (e) Pearson correlation of log p-values (t-test) of missing variables from complete data and imputed data. (f) PLS-Procrustes sum of squared errors.
What is QRILC (Quantile Regression Imputation of Left-Censored data):
QRILC imputation was specifically designed for left-censored data, data missing caused by lower than LOQ. This method imputes missing elements with randomly drawing from a truncated distribution estimated by a quantile regression. A beforehand log-transformation was conducted to improve the imputation accuracy. R package imputeLCMD was applied for this imputation approach.
Since metabolomics usually suffer from MNAR, QRILC imputation method is best for it.
For attribution, please cite this work as
Hu (2018, Nov. 5). Metabox-Blog: How to deal with Missing Value. Retrieved from https://hushuli.github.io/Metabox-Blog.github.io/posts/2018-11-08-missing-value/
BibTeX citation
@misc{hu2018how,
author = {Hu, Shuli},
title = {Metabox-Blog: How to deal with Missing Value},
url = {https://hushuli.github.io/Metabox-Blog.github.io/posts/2018-11-08-missing-value/},
year = {2018}
}