I want to do regression imputation with dplyr in R efficiently. Here is my problem: I have a data set with many missing values for one column - let's call it p
. Now I want to estimate the missing values of p
with a regression imputation approach. For that I regress p
on a set of variables with OLS using uncensored data (a subset of the data set without missing values for p
). Then I use the estimated coefficients to calculate the missing values of p
.
My data set looks like that:
df = data.frame(
id = c(1, 1, 1, 2, 2, 2),
group = c(1, 1, 2, 1, 1, 2),
sub_group = c(1, 2, 3, 1, 2, 3),
p = c(4.3, 5.7, NA, NA, NA, 10),
var1 = c(0.3, 0.1, 0.4, 0.9, 0.1, 0.2),
var2 = c(0, 0, 0, 1, 1, 1)
)
where id
represent individuals, which buy goods from a group
(e.g. "food") with subgroups
(like "bread"). p
is the price, while var1
and var2
are some demographic variables (like "education" and "age").
What I've done so far:
library(dplyr)
df <- as_tibble(df)
# Create uncensored data
uncensored_df <- df %>%
filter(!is.na(p))
# Run regression on uncensored data
imp_model <- lm(p ~ var1 var2, data = uncensored_df)
# Get the coefficients of the fitted model
coefs <- unname(imp_model$coefficients)
# Use coefficients to compute missing values of p
censored_df <-df %>%
filter(is.na(p)) %>%
group_by(id, group, sub_group) %>%
mutate(p = coefs[1] coefs[2] * var1 coefs[3] * var2)
# And finally combine the two subsets
bind_rows(uncensored_df, censored_df) %>% arrange(id, group, sub_group)
As I use more than var1
and var2
in my actual problem (about 30 variables), what is a better way to do regression imputation with dplyr? (I'm also open for non-dplyr solutions, though.)
CodePudding user response:
library(dplyr)
fit <- lm(p ~ ., data = select(df, p, starts_with("var")))
df %>%
rowwise() %>%
mutate(p = ifelse(is.na(p), predict(fit, newdata = across()), p)) %>%
ungroup()
How it works
- For starters, when fitting your model, you can subset your data frame using
select
and any of the tidyselect helpers to select your dependent variables (here usedstarts_with("var")
). This subset data frame then allows you to use the~ .
notation which means regressp
on everything in the subset data frame. - Next you create a row-wise data frame and use your model to predict where
p
is missing. In this instanceacross
turns each row into a 1x6 tibble that you can pass to thenewdata
argument.predict
then uses the model fit and this new data to predict a value ofp
.
Output
id group sub_group p var1 var2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 4.3 0.3 0
2 1 1 2 5.7 0.1 0
3 1 2 3 3.60 0.4 0
4 2 1 1 5.10 0.9 1
5 2 1 2 10.7 0.1 1
6 2 2 3 10 0.2 1