Home > Blockchain >  Replace NA values in an entire dataframe with a random sample from a vector
Replace NA values in an entire dataframe with a random sample from a vector

Time:01-08

I have a dataframe that's missing values. I want to fill those NA values not with a specific number, but instead randomly from a set of values in another vector.

ID <- c('A', 'B', 'C')
col1 <- c(5, 2, 1)
col2 <- c(8, 1, 6)
col3 <- c(NA, 2, 3)
col4 <- c(NA, 9, NA)
col5 <- c(NA, NA, NA)

replacementVals <- c(.1, .4, .7, .4, .3, .9, .4)

df <- data.frame(ID, col1, col2, col3, col4, col5)
df
  ID col1 col2 col3 col4 col5
1  A    5    8   NA   NA   NA
2  B    2    1    2    9   NA
3  C    1    6    3   NA   NA

I've tried using a combination of is.na() and sample but I haven't gotten anything to work. I know that I could do 2 for-loops indexing on every cell, checking if it's NA, and if it is then sampling 1 value from the list:

for(row in 1:nrow(df)){
  for(col in 2:ncol(df)){
    if(is.na(df[row,col])) df[row,col] <- sample(replacementVals, 1)
  }
}
df
  ID col1 col2 col3 col4 col5
1  A    5    8  0.9  0.4  0.7
2  B    2    1  2.0  9.0  0.4
3  C    1    6  3.0  0.4  0.7

But my actual dataframe is hundreds of thousands of rows and hundreds of columns, and time is a big factor. I'm hoping there's a more efficient way to go about it than brute-forcing it with for loops. Thanks!

CodePudding user response:

Using dplyr

library(dplyr)

df |> 
  mutate(across(, ~ replace(.x, is.na(.x), 
     sample(replacementVals, sum(is.na(.x)), replace = T))))
  ID col1 col2 col3 col4 col5
1  A    5    8  0.9  0.1  0.4
2  B    2    1  2.0  9.0  0.9
3  C    1    6  3.0  0.9  0.4

CodePudding user response:

Here's vectorised base R way -

set.seed(3244)
inds <- is.na(df)
df[inds] <- sample(replacementVals, sum(inds), replace = TRUE)
df

#  ID col1 col2 col3 col4 col5
#1  A    5    8  0.4  0.1  0.3
#2  B    2    1  2.0  9.0  0.4
#3  C    1    6  3.0  0.9  0.9

CodePudding user response:

Loops are not bad necessarily, especially if you are not changing the size of objects in the loop. Nevertheless, I use an apply loop in my answer. You need the replace = TRUE argument in your call to sample if your replacementVals vector is smaller than the number of values that need to be replaced:

set.seed(1111) #  for reproducibility
df[2:6] <- apply(df[2:6], 2, FUN = function(x){
  hit <- which(is.na(x))
  x[hit] <- sample(x = replacementVals, size = length(hit), replace = TRUE)
  return(x)
})
df
#   ID col1 col2 col3 col4 col5
# 1  A    5    8  0.4  0.9  0.4
# 2  B    2    1  2.0  9.0  0.4
# 3  C    1    6  3.0  0.4  0.1
  •  Tags:  
  • r
  • Related