I am working with Zillow ZTRAX data and am currently trying to use the MICE
package for imputation purposes. Unfortunately, I am running into issues and since this is my first attempt at using MICE
and doing imputation on ZTRAX data, I am having a difficult time with troubleshooting.
First, here is a look at the structure of the data, which contains 4,392,023 observations with 13 variables:
head(imputation_data)
# A tibble: 6 x 13
sale_date sale_price prop_latitude prop_longitude lot_sqft property_land_use year_built total_bedrooms total_baths airconditioning_type prop_fireplace prop_sqft building_age
<date> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 2015-01-01 798500 NA NA NA NA NA NA NA n n NA NA
2 2015-01-02 NA 42.7 -73.8 62726. GV107 1967 NA NA n n 712 48
3 2015-01-02 NA NA NA NA NA NA NA NA n n NA NA
4 2015-01-01 NA 42.8 -73.9 14810. RR101 1950 3 1 n Y 1370 65
5 2015-01-05 NA 42.7 -73.8 5227. RR101 1926 4 1.5 n Y 1770 89
6 2015-01-01 NA NA NA NA NA NA NA NA n n NA NA
As well, here is quick look at the number of NA values for each variable:
A tibble: 1 x 12
sale_date.na sale_price.na prop_lat.na prop_log.na prop_sqft.na land_use.na year_built.na bedrooms.na baths.na air.na fire.na age.na
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 2836767 730297 730297 1046670 440787 1038065 1576547 1195667 2471 0 1038065
To start, I was attempting to use MICE
to impute missing property_land_use
values, wherein - for example - RR101
indicates a single-family residence and GV107
, in the above example, refers to a "governmental emergency building" (which, according to the ZTRAX data dictionary, is likely a police station/fire house). Side note: I just realized, given the scope of this research, that I can likely filter down to just RR
type land uses. Anyways ...
I created my own formula in MICE
:
### creating LM for property land use
form1 <- list(property_land_use ~ sale_date sale_price prop_latitude prop_longitude lot_sqft)
form1 <- name.formulas(form1)
### running the model
imp1 <- mice(imputation_data, formulas = form1, print = TRUE, m = 1, seed = 12199)
Given that similar properties are likely grouped together, I believe latitude
and longitude
are good variables, as well as lot_sqft
.
The imputation runs just fine with MICE
as indicated by the output of print = TRUE
:
iter imp variable
1 1 property_land_use
1 2 property_land_use
1 3 property_land_use
1 4 property_land_use
1 5 property_land_use
2 1 property_land_use
2 2 property_land_use
2 3 property_land_use
2 4 property_land_use
2 5 property_land_use
Unfortunately, it does not seem that any imputation took place:
imp1 <- complete(imp1)
imp1 %>%
summarize(land_use.na = sum(is.na(property_land_use)))
land_use.na
1 440787
As you can see, the amount of NA values for property_land_use
remained the same from pre-imputation data.
Any help/advice/guidance would be greatly appreciated.
I assume I am missing some small within the MICE
workflow that is causing this, but I am not familiar enough with the package to know exactly what it is.
CodePudding user response:
I think the problem is that since you're not imputing the variables that predict property_land_use
, when those other variables are missing, the imputed values will also be missing. Here's a small example:
dat <- data.frame(y = c(NA,2,3, NA, 4,5,6),
x = c(1,NA, 2, 3, 4,6,5),
z = c(1,3,2,NA,6,5,4))
dat
#> y x z
#> 1 NA 1 1
#> 2 2 NA 3
#> 3 3 2 2
#> 4 NA 3 NA
#> 5 4 4 6
#> 6 5 6 5
#> 7 6 5 4
library(mice)
form1 <- list(y ~ x z)
form1 <- name.formulas(form1)
imp1 <- mice(dat, formulas = form1, print = TRUE, m = 1, seed = 12199)
#>
#> iter imp variable
#> 1 1 y
#> 2 1 y
#> 3 1 y
#> 4 1 y
#> 5 1 y
complete(imp1)
#> y x z
#> 1 3 1 1
#> 2 2 NA 3
#> 3 3 2 2
#> 4 NA 3 NA
#> 5 4 4 6
#> 6 5 6 5
#> 7 6 5 4
Note, in the example above, that y
has two missing values - the first and fourth observations. x
and z
are fully observed for the first, but not the fourth observation. When I impute using the formula and look at the completed dataset, I see that the first observation has an imputed value but the fourth does not. If I use all the information to impute all the variables, you can see that we get a full complete dataset at the end:
imp2 <- mice(dat, print = TRUE, m = 1, seed = 12199)
#>
#> iter imp variable
#> 1 1 y x z
#> 2 1 y x z
#> 3 1 y x z
#> 4 1 y x z
#> 5 1 y x z
complete(imp2)
#> y x z
#> 1 3 1 1
#> 2 2 1 3
#> 3 3 2 2
#> 4 3 3 3
#> 5 4 4 6
#> 6 5 6 5
#> 7 6 5 4
Created on 2022-11-20 by the reprex package (v2.0.1)