I want to replace 0s in my dataset using sample
to random select a value in the column to replace it with.
I have this example dataset:
Sepal.Length Sepal.Width Petal.Length Petal.Width species
1 0.0 3.5 0.0 0.2 setosa
2 4.9 3.0 0.0 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.0 setosa
5 5.0 0.0 0.0 0.0 setosa
6 0.0 0.0 0.0 0.4 setosa
I have tried:
ifelse(ir$Sepal.Width == 0, sample(ir$Sepal.Width != 0), ir$Sepal.Width)
[1] 3.5 3.0 3.2 3.1 0.0 0.0 3.4 1.0 2.9 1.0 3.7 1.0 3.0 3.0 4.0
The zero's still remain. I've tried to loop this for all the columns as doing the code above for each column is too time-consuming and I've tried:
lapply(ir[,-5], function(x)ifelse(ir[,1:4] == 0, sample(ir[,1:4]),ir[,1:4]))
However it creates unnecessary columns of data with the zeros still remaining.
Reproducible code:
structure(list(Sepal.Length = c(0, 4.9, 4.7, 4.6, 5, 0, 4.6,
5, 4.4, 0, 5.4, 4.8, 0, 0, 0), Sepal.Width = c(3.5, 3, 3.2, 3.1,
0, 0, 3.4, 0, 2.9, 0, 3.7, 0, 3, 3, 4), Petal.Length = c(0, 0,
1.3, 1.5, 0, 0, 1.4, 1.5, 1.4, 1.5, 0, 1.6, 1.4, 1.1, 1.2), Petal.Width = c(0.2,
0.2, 0.2, 0, 0, 0.4, 0.3, 0.2, 0.2, 0, 0.2, 0, 0, 0, 0.2), species = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("setosa",
"versicolor", "virginica"), class = "factor")), row.names = c(NA,
15L), class = "data.frame")
CodePudding user response:
Here is a short dplyr
solution:
ir %>%
mutate(across(.cols = where(is.numeric),
~ replace(., . == 0, sample(.[. != 0], length(.[. == 0]), replace=T))))
You may or may not need replace=T
, which allow to repeat sampled elements.
CodePudding user response:
Function that replaces zeros with random non zero value from vector:
f <- function(vec){
ind <- vec == 0
vec[ind] <- sample(vec[!ind], sum(ind), TRUE)
vec
}
apply function f
to each numeric column:
library(data.table)
num_cols <- names(df)[as.vector(lapply(df, class)) == "numeric"]
setDT(df)[, (num_cols) := lapply(.SD, f), .SD = num_cols]
or using base R
num_cols <- names(df)[as.vector(lapply(df, class)) == "numeric"]
df[num_cols] <- lapply(df[num_cols], f)
note
it would be better to use this sample
function from book Advanced R:
sample <- function(x, size = NULL, replace = FALSE, prob = NULL) {
size <- size %||% length(x)
x[sample.int(length(x), size, replace = replace, prob = prob)]
}
because of the behavior of base::sample
in case when x
is numeric of length 1.
CodePudding user response:
Using data.table (library(data.table)
):
setDT(ir)
ir[, Sepal.Width :=
ifelse(Sepal.Width==0,
sample(Sepal.Width[Sepal.Width!=0], .N, replace=TRUE),
Sepal.Width),
by=species]
You could also have it sample from within the same species by adding a by
setDT(ir)
ir[, Sepal.Width :=
ifelse(Sepal.Width==0,
sample(Sepal.Width[Sepal.Width!=0], .N, replace=TRUE),
Sepal.Width),
by=species]
Getting this for all coulmns:
ir[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") :=
lapply(.SD, function(x) {
ifelse(x==0, sample(x[x!=0], size=.N, replace=TRUE), x)}),
by=species]
Note that your code
ifelse(ir$Sepal.Width == 0, sample(ir$Sepal.Width != 0), ir$Sepal.Width)
Is sampling from values of TRUE
and FALSE
because you are not subsetting with this logical operation ir$Sepal.Width != 0
- you need
ifelse(ir$Sepal.Width == 0, sample(ir$Sepal.Width[ir$Sepal.Width != 0]), ir$Sepal.Width)