Home > Back-end >  Using MICE for imputation on Zillow ZTRAX data - but no imputation occurs
Using MICE for imputation on Zillow ZTRAX data - but no imputation occurs

Time:11-21

I am working with Zillow ZTRAX data and am currently trying to use the MICE package for imputation purposes. Unfortunately, I am running into issues and since this is my first attempt at using MICE and doing imputation on ZTRAX data, I am having a difficult time with troubleshooting.

First, here is a look at the structure of the data, which contains 4,392,023 observations with 13 variables:

head(imputation_data)
# A tibble: 6 x 13
  sale_date  sale_price prop_latitude prop_longitude lot_sqft property_land_use year_built total_bedrooms total_baths airconditioning_type prop_fireplace prop_sqft building_age
  <date>          <dbl>         <dbl>          <dbl>    <dbl> <chr>                  <dbl>          <dbl>       <dbl> <chr>                <chr>              <dbl>        <dbl>
1 2015-01-01     798500          NA             NA        NA  NA                        NA             NA        NA   n                    n                     NA           NA
2 2015-01-02         NA          42.7          -73.8   62726. GV107                   1967             NA        NA   n                    n                    712           48
3 2015-01-02         NA          NA             NA        NA  NA                        NA             NA        NA   n                    n                     NA           NA
4 2015-01-01         NA          42.8          -73.9   14810. RR101                   1950              3         1   n                    Y                   1370           65
5 2015-01-05         NA          42.7          -73.8    5227. RR101                   1926              4         1.5 n                    Y                   1770           89
6 2015-01-01         NA          NA             NA        NA  NA                        NA             NA        NA   n                    n                     NA           NA

As well, here is quick look at the number of NA values for each variable:

A tibble: 1 x 12
  sale_date.na sale_price.na prop_lat.na prop_log.na prop_sqft.na land_use.na year_built.na bedrooms.na baths.na air.na fire.na  age.na
         <int>         <int>       <int>       <int>        <int>       <int>         <int>       <int>    <int>  <int>   <int>   <int>
1            0       2836767      730297      730297      1046670      440787       1038065     1576547  1195667   2471       0 1038065

To start, I was attempting to use MICE to impute missing property_land_use values, wherein - for example - RR101 indicates a single-family residence and GV107, in the above example, refers to a "governmental emergency building" (which, according to the ZTRAX data dictionary, is likely a police station/fire house). Side note: I just realized, given the scope of this research, that I can likely filter down to just RR type land uses. Anyways ...

I created my own formula in MICE:

### creating LM for property land use
form1 <- list(property_land_use ~ sale_date   sale_price   prop_latitude   prop_longitude   lot_sqft)
form1 <- name.formulas(form1)

### running the model
imp1 <- mice(imputation_data, formulas = form1, print = TRUE, m = 1, seed = 12199)

Given that similar properties are likely grouped together, I believe latitude and longitude are good variables, as well as lot_sqft.

The imputation runs just fine with MICE as indicated by the output of print = TRUE:

iter imp variable
  1   1  property_land_use
  1   2  property_land_use
  1   3  property_land_use
  1   4  property_land_use
  1   5  property_land_use
  2   1  property_land_use
  2   2  property_land_use
  2   3  property_land_use
  2   4  property_land_use
  2   5  property_land_use

Unfortunately, it does not seem that any imputation took place:

imp1 <- complete(imp1)
imp1 %>%
  summarize(land_use.na = sum(is.na(property_land_use)))

land_use.na
1      440787

As you can see, the amount of NA values for property_land_use remained the same from pre-imputation data.

Any help/advice/guidance would be greatly appreciated.

I assume I am missing some small within the MICE workflow that is causing this, but I am not familiar enough with the package to know exactly what it is.

CodePudding user response:

I think the problem is that since you're not imputing the variables that predict property_land_use, when those other variables are missing, the imputed values will also be missing. Here's a small example:

dat <- data.frame(y = c(NA,2,3, NA, 4,5,6), 
                  x = c(1,NA, 2, 3, 4,6,5), 
                  z = c(1,3,2,NA,6,5,4))

dat
#>    y  x  z
#> 1 NA  1  1
#> 2  2 NA  3
#> 3  3  2  2
#> 4 NA  3 NA
#> 5  4  4  6
#> 6  5  6  5
#> 7  6  5  4
library(mice)
form1 <- list(y ~ x   z)
form1 <- name.formulas(form1)

imp1 <- mice(dat, formulas = form1, print = TRUE, m = 1, seed = 12199)
#> 
#>  iter imp variable
#>   1   1  y
#>   2   1  y
#>   3   1  y
#>   4   1  y
#>   5   1  y
complete(imp1)
#>    y  x  z
#> 1  3  1  1
#> 2  2 NA  3
#> 3  3  2  2
#> 4 NA  3 NA
#> 5  4  4  6
#> 6  5  6  5
#> 7  6  5  4

Note, in the example above, that y has two missing values - the first and fourth observations. x and z are fully observed for the first, but not the fourth observation. When I impute using the formula and look at the completed dataset, I see that the first observation has an imputed value but the fourth does not. If I use all the information to impute all the variables, you can see that we get a full complete dataset at the end:

imp2 <- mice(dat, print = TRUE, m = 1, seed = 12199)
#> 
#>  iter imp variable
#>   1   1  y  x  z
#>   2   1  y  x  z
#>   3   1  y  x  z
#>   4   1  y  x  z
#>   5   1  y  x  z
complete(imp2)
#>   y x z
#> 1 3 1 1
#> 2 2 1 3
#> 3 3 2 2
#> 4 3 3 3
#> 5 4 4 6
#> 6 5 6 5
#> 7 6 5 4

Created on 2022-11-20 by the reprex package (v2.0.1)

  • Related