I have a dataframe that's missing values. I want to fill those NA
values not with a specific number, but instead randomly from a set of values in another vector.
ID <- c('A', 'B', 'C')
col1 <- c(5, 2, 1)
col2 <- c(8, 1, 6)
col3 <- c(NA, 2, 3)
col4 <- c(NA, 9, NA)
col5 <- c(NA, NA, NA)
replacementVals <- c(.1, .4, .7, .4, .3, .9, .4)
df <- data.frame(ID, col1, col2, col3, col4, col5)
df
ID col1 col2 col3 col4 col5
1 A 5 8 NA NA NA
2 B 2 1 2 9 NA
3 C 1 6 3 NA NA
I've tried using a combination of is.na()
and sample
but I haven't gotten anything to work. I know that I could do 2 for-loops indexing on every cell, checking if it's NA
, and if it is then sampling 1 value from the list:
for(row in 1:nrow(df)){
for(col in 2:ncol(df)){
if(is.na(df[row,col])) df[row,col] <- sample(replacementVals, 1)
}
}
df
ID col1 col2 col3 col4 col5
1 A 5 8 0.9 0.4 0.7
2 B 2 1 2.0 9.0 0.4
3 C 1 6 3.0 0.4 0.7
But my actual dataframe is hundreds of thousands of rows and hundreds of columns, and time is a big factor. I'm hoping there's a more efficient way to go about it than brute-forcing it with for loops. Thanks!
CodePudding user response:
Using dplyr
library(dplyr)
df |>
mutate(across(, ~ replace(.x, is.na(.x),
sample(replacementVals, sum(is.na(.x)), replace = T))))
ID col1 col2 col3 col4 col5
1 A 5 8 0.9 0.1 0.4
2 B 2 1 2.0 9.0 0.9
3 C 1 6 3.0 0.9 0.4
CodePudding user response:
Here's vectorised base R way -
set.seed(3244)
inds <- is.na(df)
df[inds] <- sample(replacementVals, sum(inds), replace = TRUE)
df
# ID col1 col2 col3 col4 col5
#1 A 5 8 0.4 0.1 0.3
#2 B 2 1 2.0 9.0 0.4
#3 C 1 6 3.0 0.9 0.9
CodePudding user response:
Loops are not bad necessarily, especially if you are not changing the size of objects in the loop. Nevertheless, I use an apply
loop in my answer. You need the replace = TRUE
argument in your call to sample if your replacementVals
vector is smaller than the number of values that need to be replaced:
set.seed(1111) # for reproducibility
df[2:6] <- apply(df[2:6], 2, FUN = function(x){
hit <- which(is.na(x))
x[hit] <- sample(x = replacementVals, size = length(hit), replace = TRUE)
return(x)
})
df
# ID col1 col2 col3 col4 col5
# 1 A 5 8 0.4 0.9 0.4
# 2 B 2 1 2.0 9.0 0.4
# 3 C 1 6 3.0 0.4 0.1