Refs:
Let’s load the Social Indicators Survey dataset for this tutorial (cf. dataset’s codebook):
sis <- read.csv("siswave3v4impute3.csv", sep=";")
# from Gelman&Hill's book:
na.fix <- function (a) {
ifelse(a<0 | a==999999, NA, a)
}
sis$earnings <- (na.fix(sis$rearn) + na.fix(sis$tearn))/1000
# select a column subset of the entire dataset
sis_sm <- as.data.frame(with(sis, cbind(sex, race, educ_r, r_age, earnings, police)))
sis_sm[91:95,]
## sex race educ_r r_age earnings police
## 91 1 3 3 31 NA 0
## 92 2 1 2 37 135 1
## 93 0 1 1 3 NA 2
## 94 1 1 3 42 3 1
## 95 1 3 1 24 0 NA
# clean dataset
sis_sm <- sis_sm[c(-93,-757),] # remove lines 93 & 756 (typos?)
# give proper types to predictors
sis_sm$sex <- as.factor(sis_sm$sex)
sis_sm$race <- as.factor(sis_sm$race)
sis_sm$educ_r <- as.ordered(sis_sm$educ_r) # ordered factor
sis_sm$police <- as.factor(sis_sm$police)
sapply(sis_sm, class)
## $sex
## [1] "factor"
##
## $race
## [1] "factor"
##
## $educ_r
## [1] "ordered" "factor"
##
## $r_age
## [1] "numeric"
##
## $earnings
## [1] "numeric"
##
## $police
## [1] "factor"
summary(sis_sm)
## sex race educ_r r_age earnings police
## 1 :543 1 :478 1 :229 Min. :18.00 Min. : 0.00 0 :490
## 2 :956 2 :401 2 :398 1st Qu.:30.00 1st Qu.: 5.00 1 :941
## NA's: 2 3 :453 3 :376 Median :39.00 Median : 28.00 NA's: 70
## 4 :130 4 :483 Mean :41.33 Mean : 52.56
## NA's: 39 NA's: 15 3rd Qu.:50.00 3rd Qu.: 65.00
## Max. :97.00 Max. :3250.00
## NA's :2 NA's :252
Notice the existence of several NA
values, ie, missing data.
There are several types of missing data (this is Gelman & Hill classification):
NA
is the same for all rows and columns. This is the best-case scenario, because if we remove those samples from the inference, the bias is not affected. However, this is also the least probable case for our dataset.For instance, our dataset does not seem to be MCAR:
library(dplyr)
library(magrittr)
sis_dp <- tbl_df(sis_sm)
check.income.NA <- function(for.race) {
df <- sis_dp %>% dplyr::filter(race==for.race)
n <- df %>% nrow()
n.NA <- df %>% dplyr::filter(is.na(earnings)) %>% nrow()
list(total=n, total.NA=n.NA, perc=n.NA/n)
}
check.income.NA(1) # whites
## $total
## [1] 478
##
## $total.NA
## [1] 92
##
## $perc
## [1] 0.1924686
check.income.NA(2) # blacks
## $total
## [1] 401
##
## $total.NA
## [1] 47
##
## $perc
## [1] 0.117207
The missing value percentage for whites and blacks is quite different. This is evidence that this dataset is not MCAR.
MAR - Missing at Random. Here the hypothesis is that the probability of a certain variable to be NA
is based on the available information. So, if income
is missing then the missing probability is modelled by the other available data like sex, race, etc.
Missing that depends on unobserved predictors. In this case, there is lack of information in the dataset. Eg, say education is important to reveal income, but education information was not collected. Since it is likely that the dataset is not representative of the education population (there was no control for it), there will be extra bias in our inferences.
Censoring. It’s when the value of the predictor influences the probability of missingness. Perhaps people with large incomes usually prefer not to state them in the survey. This can be mitigated by adding more data. In this eg, people with high education or high payed jobs tend to have higher incomes, so we could model the income with this extra information.
A good methodology is to include as many predictors as possible so that the MAR hypothesis is reasonable for our model.
mice
We’ll use the mice
package to deal with missing data.
library(mice)
Function md.pattern
presents a report about how missing data is spread over the dataset:
md.pattern(sis_sm)
## sex r_age educ_r race police earnings
## 1169 1 1 1 1 1 1 0
## 21 1 1 1 0 1 1 1
## 6 1 1 0 1 1 1 1
## 219 1 1 1 1 1 0 1
## 50 1 1 1 1 0 1 1
## 12 1 1 1 0 1 0 2
## 3 1 1 0 1 1 0 2
## 1 1 1 1 0 0 1 2
## 2 1 1 0 1 0 1 2
## 12 1 1 1 1 0 0 2
## 1 1 1 0 0 1 0 3
## 2 1 1 1 0 0 0 3
## 1 1 1 0 1 0 0 3
## 2 0 0 0 0 0 0 6
## 2 2 15 39 70 252 380
We see that 1170 rows are complete (have zero NA
), 21 miss the race
info, and so on…
In this dataset, we could simply remove the samples with more missing values. For eg, the last six samples, with three or more NA
s, could be removed without a significant loss of information. There are 30 rows with two missing values. It would be a decision for the data analyst to keep or remove them.
We can also visualize this information:
library(VIM)
aggr(sis_sm, col=c('blue','red'), numbers=TRUE, sortVars=TRUE,
labels=names(sis_sm), cex.axis=.8, gap=3,
ylab=c("Histogram of missing data", "Red is Missing Data"))
##
## Variables sorted by number of missings:
## Variable Count
## earnings 0.167888075
## police 0.046635576
## race 0.025982678
## educ_r 0.009993338
## sex 0.001332445
## r_age 0.001332445
Another function is md.pairs
which computes the number of observations for all pairs of variables and for all pairs (observed/missing,observed/missing):
md.pairs(sis_sm)
## $rr
## sex race educ_r r_age earnings police
## sex 1499 1462 1486 1499 1249 1431
## race 1462 1462 1450 1462 1227 1397
## educ_r 1486 1450 1486 1486 1241 1421
## r_age 1499 1462 1486 1499 1249 1431
## earnings 1249 1227 1241 1249 1249 1196
## police 1431 1397 1421 1431 1196 1431
##
## $rm
## sex race educ_r r_age earnings police
## sex 0 37 13 0 250 68
## race 0 0 12 0 235 65
## educ_r 0 36 0 0 245 65
## r_age 0 37 13 0 250 68
## earnings 0 22 8 0 0 53
## police 0 34 10 0 235 0
##
## $mr
## sex race educ_r r_age earnings police
## sex 0 0 0 0 0 0
## race 37 0 36 37 22 34
## educ_r 13 12 0 13 8 10
## r_age 0 0 0 0 0 0
## earnings 250 235 245 250 0 235
## police 68 65 65 68 53 0
##
## $mm
## sex race educ_r r_age earnings police
## sex 2 2 2 2 2 2
## race 2 39 3 2 17 5
## educ_r 2 3 15 2 7 5
## r_age 2 2 2 2 2 2
## earnings 2 17 7 2 252 17
## police 2 5 5 2 17 70
For eg, there are 251 samples where the sex value is present but the earnings are not (cf. matrix $rm
).
The simplest method is to remove the samples with missing data:
sis_sm1a <- cc(sis_sm)
head(sis_sm1a)
## sex race educ_r r_age earnings police
## 1 1 1 4 29 84.0 0
## 3 1 3 2 36 27.5 0
## 4 1 1 4 71 85.0 1
## 5 1 4 4 30 135.0 1
## 7 2 1 4 32 92.0 1
## 8 2 2 2 67 0.0 1
This however can be too much if many samples have columns with NA
values.
Also, if the removed samples are in any way different from the complete samples, then the analysis over the complete cases will bias the analysis.
A partial solution is to remove only a minority of samples that have too much missing information. In out eg, say we decide to remove all the cases with three or more NA
values:
n_NAs <- apply(sis_sm, 1, function(x) sum(is.na(x))) # number of NAs per row
sis_sm2 <- sis_sm[n_NAs<3,] # keep those with 0,1,2 NAs
md.pattern(sis_sm2)
## sex r_age educ_r race police earnings
## 1169 1 1 1 1 1 1 0
## 21 1 1 1 0 1 1 1
## 6 1 1 0 1 1 1 1
## 219 1 1 1 1 1 0 1
## 50 1 1 1 1 0 1 1
## 12 1 1 1 0 1 0 2
## 3 1 1 0 1 1 0 2
## 1 1 1 1 0 0 1 2
## 2 1 1 0 1 0 1 2
## 12 1 1 1 1 0 0 2
## 0 0 11 34 65 246 356
If we only partially remove missing data, we still have to deal with the remaining NA
values.
Another method is to use different subsets of the original dataset to answer different inference questions. In this case, we would remove less incomplete samples since we would be using less columns per inference. This means that these various inferences might not be consistent with each other, since each used different data.
Data imputation is the replacement of NA
values for some estimated value based on an imputation algorithm.
Categorical variables are tricky. For eg, we cannot replace them by the mean, since that would probably add a value not within the domain of that variable. A better alternative would be to replace by the mode, but even that is not advisable.
A better method is to fit a linear regression or logistic regression using the available data, and then, for each missing row, use the respective prediction as the imputed value. A variant is to add some reasonable random noise (also estimated from the available data) into the imputation.
Function mice
executes a given imputation method over the dataset. It has several methods available:
methods(mice)
## [1] mice.impute.2l.norm mice.impute.2l.pan mice.impute.2lonly.mean mice.impute.2lonly.norm mice.impute.2lonly.pmm mice.impute.cart
## [7] mice.impute.fastpmm mice.impute.lda mice.impute.logreg mice.impute.logreg.boot mice.impute.mean mice.impute.norm
## [13] mice.impute.norm.boot mice.impute.norm.nob mice.impute.norm.predict mice.impute.passive mice.impute.pmm mice.impute.polr
## [19] mice.impute.polyreg mice.impute.quadratic mice.impute.rf mice.impute.ri mice.impute.sample mice.mids
## [25] mice.theme
## see '?methods' for accessing help and source code
Here’s information about some of these methods:
pmm
is predictive mean matching (for numeric)mean
is unconditional mean imputation (for numeric)norm
is Bayesian Linear Regression (for numeric)norm.nob
is (classical) Linear Regression (for numeric)logreg
is logistic regression (for 2 factors)polyreg
is multinomial logit (for 2+ factors)polr
is ordered logit (for 2+ ordered factor)We can use specific methods for each column (if a predictor is complete, just place ""
in the meth
argument list):
sapply(sis_sm2, class) # check again the types of each predictor
## $sex
## [1] "factor"
##
## $race
## [1] "factor"
##
## $educ_r
## [1] "ordered" "factor"
##
## $r_age
## [1] "numeric"
##
## $earnings
## [1] "numeric"
##
## $police
## [1] "factor"
ni <- 7 # the number of imputations
imp.data <- mice(sis_sm2, m=ni, maxit=50,
meth=c('logreg', 'polyreg', 'polr', 'pmm', 'pmm', 'logreg'),
seed=99, diag=FALSE, print=FALSE)
summary(imp.data)
## Multiply imputed data set
## Call:
## mice(data = sis_sm2, m = ni, method = c("logreg", "polyreg",
## "polr", "pmm", "pmm", "logreg"), maxit = 50, diagnostics = FALSE,
## printFlag = FALSE, seed = 99)
## Number of multiple imputations: 7
## Missing cells per column:
## sex race educ_r r_age earnings police
## 0 34 11 0 246 65
## Imputation methods:
## sex race educ_r r_age earnings police
## "logreg" "polyreg" "polr" "pmm" "pmm" "logreg"
## VisitSequence:
## race educ_r earnings police
## 2 3 5 6
## PredictorMatrix:
## sex race educ_r r_age earnings police
## sex 0 0 0 0 0 0
## race 1 0 1 1 1 1
## educ_r 1 1 0 1 1 1
## r_age 0 0 0 0 0 0
## earnings 1 1 1 1 0 1
## police 1 1 1 1 1 0
## Random generator seed value: 99
Mice give us information about which set of predictors were used to impute a certain variable. This information is kept in the predictor matrix:
[The predictor matrix] rows correspond to target variables (i.e. variables to be imputed), in the sequence as they appear in data. A value of ‘1’ means that the column variable is used as a predictor for the target variable (in the rows). The diagonal of predictorMatrix must be zero [ref: mice help]
This predictor matrix can be supplied by the user in argument pred
, which allows her to control which predictors are used for each variable.
We can check the values assigned to each sample column at each data imputation for diagnostic checking, ie, to see if the imputations are plausible:
head(imp.data$imp$police, 10) # for this predictor, should be 0s and 1s only
## 1 2 3 4 5 6 7
## 31 0 1 1 1 1 1 0
## 38 1 1 0 1 0 0 0
## 52 1 1 1 1 1 1 1
## 54 1 1 1 1 0 0 1
## 58 1 1 1 1 1 1 1
## 95 0 0 1 0 0 1 1
## 115 1 1 1 1 1 1 1
## 136 0 0 0 1 0 0 0
## 159 0 1 1 0 1 1 1
## 167 1 1 1 1 1 1 1
Package mice
provides several plotting tools.
Function stripplot
shows the distributions for each predictor:
library(lattice)
stripplot(imp.data, pch = 20, cex = 1)
We can visualize if the density plots of the original data and the inputed data are close:
densityplot(imp.data, xlim=c(-1e2,1e3)) # blue is the original data, red are the imputations
In this case, we only have one continuous variable (income). Despite the plot, there are no negative incomes, this is only an effect of fitting a density to the actual datapoints.
And this is a scatterplot of earnings against other predictors:
xyplot(imp.data, earnings ~ race+educ_r+police+r_age, pch=20, cex=1)
One way to check if the imputation algorithm ran ok, is to verify how the mean and variance of each imputation evolved over iterations (let’s call them imputation streams). A sign of convergence is to see each stream mixing with the other, with no sign of a trend. If the streams look flat, something wrong happened and the user should try other methods, increase the number of interations (fortunately, the number of iterations needed are much lower than for MCMC methods) or make changes to the predictor matrix.
plot(imp.data, c("race", "earnings", "police", "educ_r"))
Function complete
uses the inputation information to create a complete dataset:
head(sis_sm2)
## sex race educ_r r_age earnings police
## 1 1 1 4 29.00000 84.0 0
## 2 2 3 4 40.00000 NA 1
## 3 1 3 2 36.00000 27.5 0
## 4 1 1 4 71.00000 85.0 1
## 5 1 4 4 30.00000 135.0 1
## 6 2 2 4 41.32423 NA 0
sis_sm2c <- complete(imp.data, action=1)
head(sis_sm2c)
## sex race educ_r r_age earnings police
## 1 1 1 4 29.00000 84.0 0
## 2 2 3 4 40.00000 32.5 1
## 3 1 3 2 36.00000 27.5 0
## 4 1 1 4 71.00000 85.0 1
## 5 1 4 4 30.00000 135.0 1
## 6 2 2 4 41.32423 26.0 0
In function complete
, the argument action='long'
creates a larger dataset using all the imputation’ values (check help), which can be then split into differen inputed datasets:
n <- nrow(sis_sm2) # this dataset has 1495 rows
sis_sm2cl <- complete(imp.data, action='long')
nrow(sis_sm2cl) # 1495 * 7 inputations
## [1] 10465
# split dataframes
dfs <- split(sis_sm2cl[,c(-1,-2)], f= sis_sm2cl$.imp)
head(dfs[[1]]) # first imputation
## sex race educ_r r_age earnings police
## 1 1 1 4 29.00000 84.0 0
## 2 2 3 4 40.00000 32.5 1
## 3 1 3 2 36.00000 27.5 0
## 4 1 1 4 71.00000 85.0 1
## 5 1 4 4 30.00000 135.0 1
## 6 2 2 4 41.32423 26.0 0
head(dfs[[ni]]) # last imputation
## sex race educ_r r_age earnings police
## 8971 1 1 4 29.00000 84.0 0
## 8972 2 3 4 40.00000 70.0 1
## 8973 1 3 2 36.00000 27.5 0
## 8974 1 1 4 71.00000 85.0 1
## 8975 1 4 4 30.00000 135.0 1
## 8976 2 2 4 41.32423 0.0 0
We can apply all the imputations in an inference and then pool the results. Let’s assume we would like to check if earnings is influenced by any other predictor:
# performing ni linear regressions
fits <- with(imp.data, lm(earnings ~ sex+race+police+educ_r))
summary(fits$analyses[[ni]]) # just from one regression
##
## Call:
## lm(formula = earnings ~ sex + race + police + educ_r)
##
## Residuals:
## Min 1Q Median 3Q Max
## -102.50 -34.66 -11.27 12.83 3150.41
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.755 11.662 1.608 0.1080
## sex2 -2.911 6.135 -0.474 0.6352
## race2 3.006 8.036 0.374 0.7084
## race3 -14.345 8.067 -1.778 0.0755 .
## race4 -16.101 11.078 -1.453 0.1463
## police2 13.316 6.341 2.100 0.0359 *
## educ_r2 8.291 9.518 0.871 0.3838
## educ_r3 23.016 9.641 2.387 0.0171 *
## educ_r4 67.425 9.743 6.920 6.67e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 112.6 on 1486 degrees of freedom
## Multiple R-squared: 0.07068, Adjusted R-squared: 0.06568
## F-statistic: 14.13 on 8 and 1486 DF, p-value: < 2.2e-16
summary(pool(fits)) # pooled results
## est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda
## (Intercept) 20.889623 13.115835 1.5927025 134.44978 1.135753e-01 -5.0504219 46.8296674 NA 0.21060091 0.19894494
## sex2 -1.214141 6.261670 -0.1939005 1214.90625 8.462862e-01 -13.4990274 11.0707457 NA 0.02951440 0.02791808
## race2 -2.998659 10.141810 -0.2956729 43.55632 7.688849e-01 -23.4440166 17.4466990 NA 0.38989176 0.36250583
## race3 -19.261405 9.554186 -2.0160174 72.37198 4.751091e-02 -38.3056540 -0.2171554 NA 0.29719480 0.27803748
## race4 -19.967243 13.266031 -1.5051407 66.91614 1.369938e-01 -46.4469630 6.5124762 NA 0.31009511 0.28977873
## police2 13.189392 6.472954 2.0376157 1240.65052 4.180019e-02 0.4902471 25.8885374 NA 0.02772790 0.02616180
## educ_r2 9.826511 9.942971 0.9882871 656.63073 3.233761e-01 -9.6973422 29.3503636 NA 0.07205852 0.06923645
## educ_r3 24.219873 9.817695 2.4669613 1351.48474 1.374976e-02 4.9602971 43.4794498 NA 0.01940862 0.01795856
## educ_r4 69.312043 10.037361 6.9054053 1042.69290 8.687717e-12 49.6163150 89.0077701 NA 0.04115345 0.03931603
The p-values (the column Pr(>|t|)
) indicate that the two higher education levels are strongly significant to predict income (the higher education, the higher the income). Having police in the neighborhood (police=1
) is weakly significant. Also, hispanic subjects (race=3
) also seem to have weakly negative significance.
simputation
simputation
is a new (September 2016) package that standardizes the interface to access different imputation methods.
library(magrittr)
library(simputation)
In the author’s vignette, he uses the iris dataset with some NAs:
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- dat[8:10,5] <- NA # make some NAs
head(dat,10)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 NA 3.5 1.4 0.2 setosa
## 2 NA 3.0 1.4 0.2 setosa
## 3 NA NA 1.3 0.2 setosa
## 4 4.6 NA 1.5 0.2 setosa
## 5 5.0 NA 1.4 0.2 setosa
## 6 5.4 NA 1.7 0.4 setosa
## 7 4.6 NA 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 <NA>
## 9 4.4 2.9 1.4 0.2 <NA>
## 10 4.9 3.1 1.5 0.1 <NA>
The next code imputes Sepal.Width and Sepal.Length by regression using the values of Petal.Width and Species:
dat %>%
impute_lm(Sepal.Width + Sepal.Length ~ Petal.Width + Species) -> imputed
head(imputed,10)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 4.979844 3.500000 1.4 0.2 setosa
## 2 4.979844 3.000000 1.4 0.2 setosa
## 3 4.979844 3.409547 1.3 0.2 setosa
## 4 4.600000 3.409547 1.5 0.2 setosa
## 5 5.000000 3.409547 1.4 0.2 setosa
## 6 5.400000 3.561835 1.7 0.4 setosa
## 7 4.600000 3.485691 1.4 0.3 setosa
## 8 5.000000 3.400000 1.5 0.2 <NA>
## 9 4.400000 2.900000 1.4 0.2 <NA>
## 10 4.900000 3.100000 1.5 0.1 <NA>
Next, let’s impute the missing Species values wusing a decision tree model:
imputed %>%
impute_cart(Species ~ .) -> imputed
head(imputed,10)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 4.979844 3.500000 1.4 0.2 setosa
## 2 4.979844 3.000000 1.4 0.2 setosa
## 3 4.979844 3.409547 1.3 0.2 setosa
## 4 4.600000 3.409547 1.5 0.2 setosa
## 5 5.000000 3.409547 1.4 0.2 setosa
## 6 5.400000 3.561835 1.7 0.4 setosa
## 7 4.600000 3.485691 1.4 0.3 setosa
## 8 5.000000 3.400000 1.5 0.2 setosa
## 9 4.400000 2.900000 1.4 0.2 setosa
## 10 4.900000 3.100000 1.5 0.1 setosa
The package can impute for each class separately:
# complete 'Species' using the sequential hot desk (shd) method
dat %>% impute_shd(Species ~ Petal.Length) -> dat2
# then impute Sepal.Length by regressing on Sepal.Width, for each Species
dat2 %>% impute_lm(Sepal.Length ~ Sepal.Width | Species) %>% head()
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.067813 3.5 1.4 0.2 setosa
## 2 4.725677 3.0 1.4 0.2 setosa
## 3 NA NA 1.3 0.2 setosa
## 4 4.600000 NA 1.5 0.2 setosa
## 5 5.000000 NA 1.4 0.2 setosa
## 6 5.400000 NA 1.7 0.4 setosa
Notice that the value at the 3rd row could not be found, since Sepal.Width was also missing.
This by group imputation can also be executed using dplyr::group_by
:
dat2 %>% dplyr::group_by(Species) %>%
impute_lm(Sepal.Length ~ Sepal.Width) %>%
head()
## Source: local data frame [6 x 5]
## Groups: Species [1]
##
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fctr>
## 1 5.067813 3.5 1.4 0.2 setosa
## 2 4.725677 3.0 1.4 0.2 setosa
## 3 NA NA 1.3 0.2 setosa
## 4 4.600000 NA 1.5 0.2 setosa
## 5 5.000000 NA 1.4 0.2 setosa
## 6 5.400000 NA 1.7 0.4 setosa