Home > OS >  R - Partial mean imputation before multiple imputation
R - Partial mean imputation before multiple imputation

Time:12-10

I have a 100 row, 200 variable dataset (mostly Likert-scale survey data as factors, but also some numeric data). I have attached a reprex with similar patterns of missing data below; some blank cells are sporadic, but for some rows, an entire scale is blank.

I wanted to use the mice package's multiple imputation before subscale calculation, but multiple imputation of a factor-heavy dataset is impractical. However, I need to address the sporadically-missing cells before calculating subscale totals because if one item in the scale is NA, then the subscale total will also be NA.

My goal is to conduct single/mean imputation on the sporadically-missing factor cells, then calculate subscale totals, and then conduct multiple imputation on the numeric subscale total variables, which will target the remaining large gaps.

I would like the sporadically-missing values to be imputed with the mean of the individual's remaining scale variables. E.g., if an individual missing df$c2 then that missing value would be imputed with the mean of c1, c3, c4, and c5.

How can I conduct single/mean imputation on a subset of data, dependent on the patterns of missingness within each row?

library(reprex)
library(tidyverse)
library(missMethods)
library(finalfit)
library(mice)

set.seed(1234)

a1 <- sample(1:3, 2000, replace=TRUE)
a2 <- sample(1:3, 2000, replace=TRUE)
a3 <- sample(1:3, 2000, replace=TRUE)
a4 <- sample(1:3, 2000, replace=TRUE)
a5 <- sample(1:3, 2000, replace=TRUE)

b1 <- sample(1:3, 2000, replace=TRUE)
b2 <- sample(1:3, 2000, replace=TRUE)
b3 <- sample(1:3, 2000, replace=TRUE)
b4 <- sample(1:3, 2000, replace=TRUE)
b5 <- sample(1:3, 2000, replace=TRUE)

c1 <- sample(1:3, 2000, replace=TRUE)
c2 <- sample(1:3, 2000, replace=TRUE)
c3 <- sample(1:3, 2000, replace=TRUE)
c4 <- sample(1:3, 2000, replace=TRUE)
c5 <- sample(1:3, 2000, replace=TRUE)

d1 <- sample(1:3, 2000, replace=TRUE)
d2 <- sample(1:3, 2000, replace=TRUE)
d3 <- sample(1:3, 2000, replace=TRUE)
d4 <- sample(1:3, 2000, replace=TRUE)
d5 <- sample(1:3, 2000, replace=TRUE)

years <- sample(18:70, 2000, replace=TRUE)
gender <- sample(c("male","female"), 2000, replace=TRUE, prob=c(0.5, 0.5))
education <- sample(c("highschool","college", "gradschool"), 2000, replace=TRUE, prob=c(1/3, 1/3, 1/3))
height <- sample(60:75, 2000, replace=TRUE)

df <- data.frame(a1, a2, a3, a4, a5,
                 b1, b2, b3, b4, b5,
                 c1, c2, c3, c4, c5,
                 d1, d2, d3, d4, d5,
                 years, gender, education, height)

facts <- df %>% select(contains("gender") | contains ("education")) %>% colnames()
cols <- df %>% select(ends_with("1") | 
                        ends_with("2") | 
                        ends_with("3") |
                        ends_with("4") | 
                        ends_with("5")) %>% colnames()

df <- delete_MCAR(df, p = 0.01, cols_mis = cols) %>%
  dplyr::mutate(across(all_of(facts), factor))

df[c(200:300, 500:550, 900:1000), 1:5] <- NA
df[c(400:500, 600:650, 1100:1200), 6:10] <- NA
df[c(10:100, 300:450, 1500:1650), 11:15] <- NA
df[c(300:400, 700:800, 1700:1900), 16:20] <- NA

## I think mean imputation of the sporadically-missing cells would occur here

missing_plot(df)

df <- df %>%
  rowwise() %>%
  mutate(a_mean = mean(c(a1, a2, a3, a4, a5))) %>%
  mutate(b_mean = mean(c(b1, b2, b3, b4, b5))) %>%
  mutate(c_mean = mean(c(c1, c2, c3, c4, c5))) %>%
  mutate(d_mean = mean(c(d1, d2, d3, d4, d5)))

df <- df %>%
  select(ends_with("mean") | contains("years") | contains("gender") | contains("education") | contains("height"))
                       
imp_df <- mice::mice(df, m = 5, print = FALSE)
com <- mice::complete(imp_df)

enter image description here

CodePudding user response:

Am I understanding right that you just want to take the mean for each a1 ... d5 individually, only where the missing values are sporadically missing? Then we can define a function that sifts through to identify consecutive NA values that are less than a certain length. I based the method on enter image description here

Infill with mean of other scale variables

If we want to infill with other scale variables, we take a similar approach. Just a couple notes:

  • We need to remove missing values from the rowwise() scale means, because otherwise it's NA wherever we have a missing value and we can't do any imputation anyway.
  • We should always ungroup() after using rowwise() otherwise it's easy to forget and return strange results.
  • We'll use some of the functionality of across() to access cur_column() name, subset the letter, and get the related x_mean column in .data.
  • Define an infill_fun() to do the above wrapping na_search().
df <- df %>%
  rowwise() %>%
  mutate(a_mean = mean(c(a1, a2, a3, a4, a5), na.rm = T)) %>%
  mutate(b_mean = mean(c(b1, b2, b3, b4, b5), na.rm = T)) %>%
  mutate(c_mean = mean(c(c1, c2, c3, c4, c5), na.rm = T)) %>%
  mutate(d_mean = mean(c(d1, d2, d3, d4, d5), na.rm = T)) %>%
  ungroup()

infill_fun <- function(x, threshold, df, cur_col) {
  mean_col <- paste0(substr(cur_col, 1, 1), "_mean")
  ifelse(na_search(x, threshold),
         df[[mean_col]],
         x)
}

df %>%
  mutate(across(
    matches("[a-z][1-9]"),
    ~ infill_fun(
      .x,
      threshold = 5,
      df = .data,
      cur_col = cur_column()
    )
  )) %>%
  missing_plot()

enter image description here

  • Related