Working through survey data and I need to perform some imputing for occasional NA's popping up within the different questions of a measure. I want to replace NA with the row mean, but only on the condition that there are no more than 2 NA's across the row. Any tips on how to achieve this would be amazing.
I've tried using the code below (with some example data), but that doesn't allow me to control for how many NA's in a row that are acceptable.
data <- data.frame(
var_1 = c(2,3,NA,2,3,5,NA,3),
var_2 = c(3,4,2,3,1,3,NA,2),
var_3 = c(NA,3,2,5,4,2,NA,2),
var_4 = c(NA,3,NA,4,1,2,NA,1),
var_5 = c(NA,4,2,3,2,3,NA,2),
var_6 = c(4,2,1,NA,2,5,NA,3),
var_7 = c(3,2,1,2,2,4,NA,3))
data_fix <- data %>%
mutate(var_1 = ifelse(is.na(var_1),rowMeans(data[row_number(),], na.rm = T),var_1),
var_2 = ifelse(is.na(var_2),rowMeans(data[row_number(),], na.rm = T),var_2),
var_3 = ifelse(is.na(var_3),rowMeans(data[row_number(),], na.rm = T),var_3),
var_4 = ifelse(is.na(var_4),rowMeans(data[row_number(),], na.rm = T),var_4),
var_5 = ifelse(is.na(var_5),rowMeans(data[row_number(),], na.rm = T),var_5),
var_6 = ifelse(is.na(var_6),rowMeans(data[row_number(),], na.rm = T),var_6),
var_7 = ifelse(is.na(var_7),rowMeans(data[row_number(),], na.rm = T),var_7))
CodePudding user response:
A dplyr
solution:
library(dplyr)
data %>%
mutate(na.count = rowSums(is.na(data)),
row.mean = rowMeans(data, na.rm = TRUE),
across(var_1:var_7, ~ ifelse(is.na(.x) & na.count <= 2, row.mean, .x))) %>%
select(-c(na.count, row.mean))
var_1 var_2 var_3 var_4 var_5 var_6 var_7
1 2.0 3 NA NA NA 4.000000 3
2 3.0 4 3 3.0 4 2.000000 2
3 1.6 2 2 1.6 2 1.000000 1
4 2.0 3 5 4.0 3 3.166667 2
5 3.0 1 4 1.0 2 2.000000 2
6 5.0 3 2 2.0 3 5.000000 4
7 NA NA NA NA NA NA NA
8 3.0 2 2 1.0 2 3.000000 3
Its base
equivalent is:
na.count <- rowSums(is.na(data))
row.mean <- rowMeans(data, na.rm = TRUE)
data2 <- data
data2[] <- lapply(data, \(x) ifelse(is.na(x) & na.count <= 2, row.mean, x))
data2
CodePudding user response:
Here's an option:
rmeans <- rowMeans(data, na.rm = TRUE)
rmeans[rowSums(is.na(data)) > 2] <- NA
data2 <- data
data2[] <- rmeans
data[is.na(data)] <- data2[is.na(data)]
data
#> var_1 var_2 var_3 var_4 var_5 var_6 var_7
#> 1 2.0 3 NA NA NA 4.000000 3
#> 2 3.0 4 3 3.0 4 2.000000 2
#> 3 1.6 2 2 1.6 2 1.000000 1
#> 4 2.0 3 5 4.0 3 3.166667 2
#> 5 3.0 1 4 1.0 2 2.000000 2
#> 6 5.0 3 2 2.0 3 5.000000 4
#> 7 NA NA NA NA NA NA NA
#> 8 3.0 2 2 1.0 2 3.000000 3
Created on 2022-04-20 by the reprex package (v2.0.1)