Home > database >  Regression imputation with dplyr in R
Regression imputation with dplyr in R

Time:10-31

I want to do regression imputation with dplyr in R efficiently. Here is my problem: I have a data set with many missing values for one column - let's call it p. Now I want to estimate the missing values of p with a regression imputation approach. For that I regress p on a set of variables with OLS using uncensored data (a subset of the data set without missing values for p). Then I use the estimated coefficients to calculate the missing values of p.

My data set looks like that:

df = data.frame(
  id = c(1, 1, 1, 2, 2, 2),
  group = c(1, 1, 2, 1, 1, 2),
  sub_group = c(1, 2, 3, 1, 2, 3),
  p = c(4.3, 5.7, NA, NA, NA, 10),
  var1 = c(0.3, 0.1, 0.4, 0.9, 0.1, 0.2),
  var2 = c(0, 0, 0, 1, 1, 1)
)

where id represent individuals, which buy goods from a group (e.g. "food") with subgroups (like "bread"). p is the price, while var1 and var2 are some demographic variables (like "education" and "age").

What I've done so far:

library(dplyr)

df <- as_tibble(df)

# Create uncensored data
uncensored_df <- df %>%
filter(!is.na(p))

# Run regression on uncensored data
imp_model <- lm(p ~ var1   var2, data = uncensored_df)

# Get the coefficients of the fitted model
coefs <- unname(imp_model$coefficients)

# Use coefficients to compute missing values of p
censored_df <-df %>%
filter(is.na(p)) %>%
group_by(id, group, sub_group) %>%
  mutate(p = coefs[1]   coefs[2] * var1   coefs[3] * var2)  

# And finally combine the two subsets                                 
bind_rows(uncensored_df, censored_df) %>% arrange(id, group, sub_group)                                     

As I use more than var1 and var2 in my actual problem (about 30 variables), what is a better way to do regression imputation with dplyr? (I'm also open for non-dplyr solutions, though.)

CodePudding user response:

library(dplyr)

fit <- lm(p ~ ., data = select(df, p, starts_with("var")))


df %>% 
  rowwise() %>% 
  mutate(p = ifelse(is.na(p), predict(fit, newdata = across()), p)) %>% 
  ungroup()

How it works

  • For starters, when fitting your model, you can subset your data frame using select and any of the tidyselect helpers to select your dependent variables (here used starts_with("var")). This subset data frame then allows you to use the ~ . notation which means regress p on everything in the subset data frame.
  • Next you create a row-wise data frame and use your model to predict where p is missing. In this instance across turns each row into a 1x6 tibble that you can pass to the newdata argument. predict then uses the model fit and this new data to predict a value of p.

Output

     id group sub_group     p  var1  var2
  <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl>
1     1     1         1  4.3    0.3     0
2     1     1         2  5.7    0.1     0
3     1     2         3  3.60   0.4     0
4     2     1         1  5.10   0.9     1
5     2     1         2 10.7    0.1     1
6     2     2         3 10      0.2     1
  • Related