How to deal with Missing Value

How to deal with missing value is always the first problem we need to consider when we have a set of data with missing value. So, this article will talk about how many types of missing value and how to deal with it.

Shuli Hu (Sili Fan)http://fiehnlab.ucdavis.edu/component/contact/contact/11-members/14-wcmc/17
11-05-2018

How to deal with missing value is always the first problem we need to consider when we have a set of data with missing value. Generally, there have 3 type of missing value: Missing Complete at Random (MCAR), Missing at Random (MAR) and Missing Not at Random (MNAR). The different situation has different way to handling. In this , they compare five different imputation methods (i.e., RF, kNN, SVD, Mean, Median) for MCAR/MAR and six imputation methods (i.e., QRILC, Half-minimum, Zero, RF, kNN, SVD) for MNAR. (Since MCAR and MAR are similar, they consider them as same category.) And results showed that RF imputation performed the best for MCAR/MAR and QRILC was the favored one for MNAR. So, this article will discuss 3 types of missing value and evaluate different imputation methods.

So, next, let’s see what different of 3 types of missing value.

Difference of MCAR, MAR, and MNAR

Figure 1. Illustrations of the classification for the mechanism of missing data. Blue points are observations whereas red points are missing observations in the y-variable; statistics for complete data (blue and red combined) are slope (b) = 1, standard error (se) = 0.05 and R2 = 0.5. Assuming observations in the x-variable are complete, (a) represents missing at random (MAR), (b) represents missing not at random (MNAR) and (c) represents missing completely at random (MCAR). For the observed data (blue points), the estimated slope, se and R2, are (a) b = 0.86, se = 0.11, R2 = 0.29, (b) b = 0.432, se = 0.06, R2 = 0.23 and (c) b = 0.957, se = 0.07, R2 = 0.49.

Evaluation of imputation

Fig.2 Evaluation of different imputation methods for MCAR/MAR (a,b) NRMSE on unlabeled and labeled metabolomics data. (c,d) PCA-Procrustes sum of squared errors on unlabeled and labeled metabolomics data. (e) Pearson correlation of log p-values (t-test) of complete data and imputed data. (f) PLS-Procrustes sum of squared errors.

Fig. 3 Evaluation of different imputation methods for MNAR (a,b) SOR on unlabeled and labeled metabolomics data. (c,d) PCA-Procrustes sum of squared errors on unlabeled and labeled metabolomics data. (e) Pearson correlation of log p-values (t-test) of missing variables from complete data and imputed data. (f) PLS-Procrustes sum of squared errors.

What is QRILC (Quantile Regression Imputation of Left-Censored data):

QRILC imputation was specifically designed for left-censored data, data missing caused by lower than LOQ. This method imputes missing elements with randomly drawing from a truncated distribution estimated by a quantile regression. A beforehand log-transformation was conducted to improve the imputation accuracy. R package imputeLCMD was applied for this imputation approach.

Since metabolomics usually suffer from MNAR, QRILC imputation method is best for it.

Citation

For attribution, please cite this work as

Hu (2018, Nov. 5). Metabox-Blog: How to deal with Missing Value. Retrieved from https://hushuli.github.io/Metabox-Blog.github.io/posts/2018-11-08-missing-value/

BibTeX citation

@misc{hu2018how,
  author = {Hu, Shuli},
  title = {Metabox-Blog: How to deal with Missing Value},
  url = {https://hushuli.github.io/Metabox-Blog.github.io/posts/2018-11-08-missing-value/},
  year = {2018}
}