Regression imputation with dplyr in R-CodePudding

I want to do regression imputation with dplyr in R efficiently. Here is my problem: I have a data set with many missing values for one column - let's call it p. Now I want to estimate the missing values of p with a regression imputation approach. For that I regress p on a set of variables with OLS using uncensored data (a subset of the data set without missing values for p). Then I use the estimated coefficients to calculate the missing values of p.

My data set looks like that:

df = data.frame(
  id = c(1, 1, 1, 2, 2, 2),
  group = c(1, 1, 2, 1, 1, 2),
  sub_group = c(1, 2, 3, 1, 2, 3),
  p = c(4.3, 5.7, NA, NA, NA, 10),
  var1 = c(0.3, 0.1, 0.4, 0.9, 0.1, 0.2),
  var2 = c(0, 0, 0, 1, 1, 1)
)

where id represent individuals, which buy goods from a group (e.g. "food") with subgroups (like "bread"). p is the price, while var1 and var2 are some demographic variables (like "education" and "age").

What I've done so far:

library(dplyr)

df <- as_tibble(df)

# Create uncensored data
uncensored_df <- df %>%
filter(!is.na(p))

# Run regression on uncensored data
imp_model <- lm(p ~ var1   var2, data = uncensored_df)

# Get the coefficients of the fitted model
coefs <- unname(imp_model$coefficients)

# Use coefficients to compute missing values of p
censored_df <-df %>%
filter(is.na(p)) %>%
group_by(id, group, sub_group) %>%
  mutate(p = coefs[1]   coefs[2] * var1   coefs[3] * var2)  

# And finally combine the two subsets                                 
bind_rows(uncensored_df, censored_df) %>% arrange(id, group, sub_group)

As I use more than var1 and var2 in my actual problem (about 30 variables), what is a better way to do regression imputation with dplyr? (I'm also open for non-dplyr solutions, though.)

CodePudding user response：

library(dplyr)

fit <- lm(p ~ ., data = select(df, p, starts_with("var")))


df %>% 
  rowwise() %>% 
  mutate(p = ifelse(is.na(p), predict(fit, newdata = across()), p)) %>% 
  ungroup()

How it works

For starters, when fitting your model, you can subset your data frame using select and any of the tidyselect helpers to select your dependent variables (here used starts_with("var")). This subset data frame then allows you to use the ~ . notation which means regress p on everything in the subset data frame.
Next you create a row-wise data frame and use your model to predict where p is missing. In this instance across turns each row into a 1x6 tibble that you can pass to the newdata argument. predict then uses the model fit and this new data to predict a value of p.

Output

     id group sub_group     p  var1  var2
  <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl>
1     1     1         1  4.3    0.3     0
2     1     1         2  5.7    0.1     0
3     1     2         3  3.60   0.4     0
4     2     1         1  5.10   0.9     1
5     2     1         2 10.7    0.1     1
6     2     2         3 10      0.2     1