Home > database >  Conditional rowMeans with dplyr
Conditional rowMeans with dplyr

Time:10-22

I want to apply a conditional rowMeans to each group of IDs using dplyr. The reproducible table follows:

dat <- as.Date("2021/08/04")
len <- 4
seq(dat, by = "day", length.out = len)

input <- data.frame(
  date = c(seq(dat, by = "day", length.out = len) , seq(dat, by = "day", length.out = len)),
  user_id = c(rep("aa",4),rep("bb",4)),
  var1 = c(1:4),
  var2 = c(4,7,23,9,0,0,0,0),
  var3 = c(0,0,0,0,4,8,2,7) 
)

The logic is: For each ID, first select the columns which are numeric and their sum is =!0, and then compute the rowMean of the selected columns for that ID.

The desired output table follows:

 output = data.frame(
  date = c(seq(dat, by = "day", length.out = len) , seq(dat, by = "day", length.out = len)),
  user_id = c(rep("aa",4),rep("bb",4)),
  var1 = c(1:4),
  var2 = c(4,7,23,9,0,0,0,0),
  var3 = c(0,0,0,0,4,8,2,7),
  rowAverage = as.numeric(c(rowMeans(input[1:4,3:4]), rowMeans(input[5:8,c(3,5)])))
)

The lines that I came-up with follow, but I get some errors:

output = input %>%
    dplyr::group_by(user_id) %>%
    dplyr::mutate(rowAverage = rowMeans(select_if(function(x) {(is.numeric(x)) && (sum(x)=!0)})))

Could you please help me correct this? Thank you.

CodePudding user response:

You’re very close. You just need to supply the data to select_if(), too. Use cur_data() for that.

input %>%
  group_by(user_id) %>%
  mutate(rowAverage = rowMeans(
    select_if(cur_data(), \(x) is.numeric(x) && sum(x) != 0)
  ))
#> # A tibble: 8 × 6
#> # Groups:   user_id [2]
#>   date       user_id  var1  var2  var3 rowAverage
#>   <date>     <chr>   <int> <dbl> <dbl>      <dbl>
#> 1 2021-08-04 aa          1     4     0        2.5
#> 2 2021-08-05 aa          2     7     0        4.5
#> 3 2021-08-06 aa          3    23     0       13  
#> 4 2021-08-07 aa          4     9     0        6.5
#> 5 2021-08-04 bb          1     0     4        2.5
#> 6 2021-08-05 bb          2     0     8        5  
#> 7 2021-08-06 bb          3     0     2        2.5
#> 8 2021-08-07 bb          4     0     7        5.5

CodePudding user response:

Though the question is tagged dplyr, dplyr is not very good at processing by rows so here is a base R solution.

res <- by(input, input$user_id, \(X) {
  i <- sapply(X, \(x) is.numeric(x) && sum(x, na.rm = TRUE) != 0)
  X$rowAverage <- rowMeans(X[i])
  X
})
res <- do.call(rbind, res)
res
#>            date user_id var1 var2 var3 rowAverage
#> aa.1 2021-08-04      aa    1    4    0        2.5
#> aa.2 2021-08-05      aa    2    7    0        4.5
#> aa.3 2021-08-06      aa    3   23    0       13.0
#> aa.4 2021-08-07      aa    4    9    0        6.5
#> bb.5 2021-08-04      bb    1    0    4        2.5
#> bb.6 2021-08-05      bb    2    0    8        5.0
#> bb.7 2021-08-06      bb    3    0    2        2.5
#> bb.8 2021-08-07      bb    4    0    7        5.5

Created on 2022-10-21 with reprex v2.0.2

  • Related