Home > Software design >  rowmeans but ignore certain values when calculating the mean but na.rm=F
rowmeans but ignore certain values when calculating the mean but na.rm=F

Time:11-04

Here is a sample dataset:


data <- data.frame(x=c(4,3,4,4,99),
                  y=c(4,NA,3,2,4),
                  z = c(88,NA,4,4,5),
                  w = c(4,5,2,3,4))

I would like to create a new column for means using rowMeans. I would like to keep na.rm=F because if its truly NA I do not want to include that into my means calculation. But if its either 88/99 I would like R to ignore it while calculating the mean and still use the remaining valid values. So far I have the below.

data$mean <- rowMeans(subset(data, select = c(`x`,`y`,`z`,`w`)), na.rm = T)

But I am not sure how to add in a function where it would just ignore the 88 and 99 from calculations.

This is what I am hoping to get

data <- data.frame(x=c(4,3,4,4,99),
                   y=c(4,NA,3,2,4),
                   z = c(88,NA,4,4,5),
                   w = c(4,5,2,3,4),
                   mean=c(4,NA,3.25,3.25,4.3))

Any help is appreciated - thank you!

CodePudding user response:

Using rowMeans nevertheless with na.rm=TRUE, but on a subset and temporally replaceing 88 and 99 with NA.

s <- rowSums(is.na(data)) == 0  ## store row subset
v <- c("x", "y", "z", "w")  ## col subset to calc. mean
data$mean <- NA  ## ini column
m <- as.matrix(data[v])  ## we'll ned a matrix
data$mean[s] <- rowMeans(replace(m[s, v], m[s, v] %in% c(88, 99), NA), na.rm=TRUE)
data
#    x  y  z w     mean
# 1  4  4 88 4 4.000000
# 2  3 NA NA 5       NA
# 3  4  3  4 2 3.250000
# 4  4  2  4 3 3.250000
# 5 99  4  5 4 4.333333

Or simply using apply but is much slower.

f <- \(x) if (any(is.na(x))) NA else mean(x[!x %in% c(88, 99)])
cbind(data, mean=apply(data, 1, f))
#    x  y  z w     mean
# 1  4  4 88 4 4.000000
# 2  3 NA NA 5       NA
# 3  4  3  4 2 3.250000
# 4  4  2  4 3 3.250000
# 5 99  4  5 4 4.333333

From microbenchmark.

# Unit: milliseconds
#     expr       min        lq      mean    median        uq        max neval cld
#    apply 35.018162 35.941815 38.834333 36.394632 36.960161 212.469412   100   b
# rowMeans  1.097393  1.119396  1.493563  1.193787  1.226691   9.352118   100  a 

CodePudding user response:

data <- data.frame(x=c(4,3,4,4,99),
                   y=c(4,NA,3,2,4),
                   z = c(88,NA,4,4,5),
                   w = c(4,5,2,3,4))

df$mean <- apply(data, 1, function(x) {
  idx <- which((x %in% c(88, 89)) == FALSE)
  mean(x[ idx ], na.rm = TRUE)
})

   x  y  z w  mean
1  4  4 88 4  4.00
2  3 NA NA 5  4.00
3  4  3  4 2  3.25
4  4  2  4 3  3.25
5 99  4  5 4 28.00
  • Related