Replace values with a sample not equal to 0-CodePudding

I want to replace 0s in my dataset using sample to random select a value in the column to replace it with.

I have this example dataset:

  Sepal.Length Sepal.Width Petal.Length Petal.Width species
1          0.0         3.5          0.0         0.2  setosa
2          4.9         3.0          0.0         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.0  setosa
5          5.0         0.0          0.0         0.0  setosa
6          0.0         0.0          0.0         0.4  setosa

I have tried:

ifelse(ir$Sepal.Width == 0, sample(ir$Sepal.Width != 0), ir$Sepal.Width)
[1] 3.5 3.0 3.2 3.1 0.0 0.0 3.4 1.0 2.9 1.0 3.7 1.0 3.0 3.0 4.0

The zero's still remain. I've tried to loop this for all the columns as doing the code above for each column is too time-consuming and I've tried:

lapply(ir[,-5], function(x)ifelse(ir[,1:4] == 0, sample(ir[,1:4]),ir[,1:4]))

However it creates unnecessary columns of data with the zeros still remaining.

Reproducible code:

structure(list(Sepal.Length = c(0, 4.9, 4.7, 4.6, 5, 0, 4.6, 
5, 4.4, 0, 5.4, 4.8, 0, 0, 0), Sepal.Width = c(3.5, 3, 3.2, 3.1, 
0, 0, 3.4, 0, 2.9, 0, 3.7, 0, 3, 3, 4), Petal.Length = c(0, 0, 
1.3, 1.5, 0, 0, 1.4, 1.5, 1.4, 1.5, 0, 1.6, 1.4, 1.1, 1.2), Petal.Width = c(0.2, 
0.2, 0.2, 0, 0, 0.4, 0.3, 0.2, 0.2, 0, 0.2, 0, 0, 0, 0.2), species = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("setosa", 
"versicolor", "virginica"), class = "factor")), row.names = c(NA, 
15L), class = "data.frame")

CodePudding user response：

Here is a short dplyr solution:

ir %>% 
  mutate(across(.cols = where(is.numeric), 
              ~ replace(., . == 0, sample(.[. != 0], length(.[. == 0]), replace=T))))

You may or may not need replace=T, which allow to repeat sampled elements.

CodePudding user response：

Function that replaces zeros with random non zero value from vector:

f <- function(vec){
  
  ind <- vec == 0
  vec[ind] <- sample(vec[!ind], sum(ind), TRUE)
  
  vec
}

apply function f to each numeric column:

library(data.table)

num_cols <- names(df)[as.vector(lapply(df, class)) == "numeric"]
setDT(df)[, (num_cols) := lapply(.SD, f), .SD = num_cols]

or using base R

num_cols <- names(df)[as.vector(lapply(df, class)) == "numeric"]
df[num_cols] <- lapply(df[num_cols], f)

note

it would be better to use this sample function from book Advanced R:

sample <- function(x, size = NULL, replace = FALSE, prob = NULL) {
  
  size <- size %||% length(x)
  x[sample.int(length(x), size, replace = replace, prob = prob)]
}

because of the behavior of base::sample in case when x is numeric of length 1.

CodePudding user response：

Using data.table (library(data.table)):

setDT(ir)
ir[, Sepal.Width := 
       ifelse(Sepal.Width==0, 
              sample(Sepal.Width[Sepal.Width!=0], .N, replace=TRUE), 
              Sepal.Width), 
     by=species]

You could also have it sample from within the same species by adding a by

setDT(ir)
ir[, Sepal.Width := 
       ifelse(Sepal.Width==0, 
              sample(Sepal.Width[Sepal.Width!=0], .N, replace=TRUE), 
              Sepal.Width), 
   by=species]

Getting this for all coulmns:

ir[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") := 
       lapply(.SD, function(x) {
         ifelse(x==0, sample(x[x!=0], size=.N, replace=TRUE), x)}), 
   by=species]

Note that your code

ifelse(ir$Sepal.Width == 0, sample(ir$Sepal.Width != 0), ir$Sepal.Width)

Is sampling from values of TRUE and FALSE because you are not subsetting with this logical operation ir$Sepal.Width != 0 - you need

ifelse(ir$Sepal.Width == 0, sample(ir$Sepal.Width[ir$Sepal.Width != 0]), ir$Sepal.Width)