Home > Blockchain >  group_by() and summarise() by row
group_by() and summarise() by row

Time:11-15

I have a data with several line ids per time and with -infinite values, and I would like to use the R packages dplyr and tidyverse to calculate the average number of -infinite per ID per time.

This is my data:

dt <- data.frame(id = rep(1:3, each = 4), 
                 time = rep(1:4, time=3), 
                 x = c(1, 2, 1, -Inf, 2, -Inf,1, 1, 5, 1, 2, 1), 
                 y = c(2, -Inf, -Inf, -Inf, -Inf, 5, -Inf, 2, 1, 2, 2, 2)) 

In the real data I have more than 100 columns but to simplify I put only x and y.

The expected result:

  id time   n
2  1    2 0.5
3  1    3 0.5
4  1    4 1.0
5  2    1 0.5
6  2    2 0.5
7  2    3 0.5

The idea is to use some specific columns to generate a vector according to a specific calculation function. After searching I found the rowwise() function, but it did not help, Here is my attempt:

dt %>%
  group_by(id,time) %>%
  summarise(n = across(x:y, ~mean(is.infinite(x) & x < 0, na.rm=TRUE)))

dt %>%
  group_by(id,time) %>% 
  rowwise() %>%
  summarise(n = across(everything(), ~mean(is.infinite(x) & x < 0, na.rm=TRUE)))

dt %>%
  rowwise() %>%
  summarise(n = across(everything(), ~mean(is.infinite(x) & x < 0, na.rm=TRUE)))
 
# same results:
`summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
# A tibble: 12 x 3
# Groups:   id [3]
      id  time   n$x    $y
   <int> <int> <dbl> <dbl>
 1     1     1     0     0
 2     1     2     0     0
 3     1     3     0     0
 4     1     4     1     1
 5     2     1     0     0
 6     2     2     1     1
 7     2     3     0     0
 8     2     4     0     0
 9     3     1     0     0
10     3     2     0     0
11     3     3     0     0
12     3     4     0     0

Could you help me to generate this vector n?

CodePudding user response:

I think I understand better what you're aiming to do here. across isn't needed (as it's more for modifying columns in place). Either rowwise or group_by would work:

library(dplyr)

dt <- data.frame(id = rep(1:3, each = 4), 
                 time = rep(1:4, times = 3), 
                 x = c(1, 2, 1, -Inf, 2, -Inf,1, 1, 5, 1, 2, 1), 
                 y = c(2, -Inf, -Inf, -Inf, -Inf, 5, -Inf, 2, 1, 2, 2, 2)) 

dt %>% 
  group_by(id, time) %>% 
  summarise(n = mean(c(is.infinite(x), is.infinite(y)))) %>% 
  filter(n != 0)
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.
#> # A tibble: 6 × 3
#> # Groups:   id [2]
#>      id  time     n
#>   <int> <int> <dbl>
#> 1     1     2   0.5
#> 2     1     3   0.5
#> 3     1     4   1  
#> 4     2     1   0.5
#> 5     2     2   0.5
#> 6     2     3   0.5

Here's a possible way of doing the calculation across any number of columns after grouping (by making a quick function to check the negative and the infinite value):

library(dplyr)

dt <- data.frame(id = rep(1:3, each = 4), 
                 time = rep(1:4, times = 3), 
                 x = c(1, 2, 1, -Inf, 2, -Inf,1, 1, 5, 1, 2, 1), 
                 y = c(2, -Inf, -Inf, -Inf, -Inf, 5, -Inf, 2, 1, 2, 2, 2),
                 z = sample(c(1, 2, -Inf), 12, replace = TRUE)) 

is_minus_inf <- function(x) is.infinite(x) & x < 0

dt %>% 
  group_by(id, time) %>% 
  mutate(n = mean(is_minus_inf(c_across(everything()))))
#> # A tibble: 12 × 6
#> # Groups:   id, time [12]
#>       id  time     x     y     z     n
#>    <int> <int> <dbl> <dbl> <dbl> <dbl>
#>  1     1     1     1     2     2 0    
#>  2     1     2     2  -Inf  -Inf 0.667
#>  3     1     3     1  -Inf     2 0.333
#>  4     1     4  -Inf  -Inf     1 0.667
#>  5     2     1     2  -Inf     1 0.333
#>  6     2     2  -Inf     5     2 0.333
#>  7     2     3     1  -Inf  -Inf 0.667
#>  8     2     4     1     2     2 0    
#>  9     3     1     5     1     1 0    
#> 10     3     2     1     2     1 0    
#> 11     3     3     2     2     2 0    
#> 12     3     4     1     2  -Inf 0.333

(Or even simpler, use mutate(n = mean(c_across(everything()) == -Inf, na.rm = TRUE)) and no new checking function is needed)

CodePudding user response:

How about this solution? It looks like giving the desired output and is scalable.
First I "melt" the columns x and y and then just summarise over id and time:

dt %>% 
   reshape2::melt(id = c("id", "time")) %>% 
   group_by(id, time) %>% 
   summarise(count_neg_inf = mean(value == -Inf, na.rm = TRUE))

regards,
Samuel

  • Related