Home > Enterprise >  How to consider the bigger date inside groups after summarize
How to consider the bigger date inside groups after summarize

Time:11-30

I'm taking the mean, 3 by 3, by grouping. For that, I'm using the summarise function. In this context I would like to select the last date from the four that make up the average.

I tried to select the maximum, but this way I'm just selecting the highest date for the whole group.

test = data.frame(my_groups = c("A", "A", "A", "B", "B", "C", "C", "C",  "A", "A", "A"),
                  measure = c(10, 20, 5, 2, 62 ,2, 5, 4, 6, 7, 25),
                  time= c("20-09-2020", "25-09-2020", "19-09-2020", "20-05-2020", "20-06-2021", 
                                      "11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021", "19-01-2021"))
# > test
#    my_groups measure       time
# 1          A      10 20-09-2020
# 2          A      20 25-09-2020
# 3          A       5 19-09-2020
# 4          B       2 20-05-2020
# 5          B      62 20-06-2021
# 6          C       2 11-01-2021
# 7          C       5 13-01-2021
# 8          C       4 13-01-2021
# 9          A       6 15-01-2021
# 10         A       7 15-01-2021
# 11         A      25 19-01-2021

test %>%
  arrange(time) %>%
  group_by(my_groups) %>%
  summarise(mean_3 = rollapply(measure, 3, mean, by = 3, align = "left", partial = F),
            final_data = max(time))

# my_groups mean_3 final_data
#   <chr>       <dbl> <chr>     
# 1 A           12.7  25-09-2020
# 2 A           11.7  25-09-2020
# 3 C           3.67 13-01-2021

In the second line I wish the date was 19-01-2021, and not the global maximum of group A, (25-09-2020).

Any hint on how I could do that?

CodePudding user response:

I have 2 dplyr ways for you. Not happy with it because when the rollapply with max and dates doesn't find anything it in group B it uses a double by default which doesn't match the characters from group A and C.

Mutate:

test %>%
  arrange(time) %>%
  group_by(my_groups) %>% 
  mutate(final = rollapply(time, 3, max, by = 3, fill = NA, align = "left", partial = F),
         mean_3 = rollapply(measure, 3, mean, by = 3, fill = NA, align = "left", partial = F)) %>% 
  filter(!is.na(final)) %>% 
  select(my_groups, final, mean_3) %>% 
  arrange(my_groups)

# A tibble: 3 x 3
# Groups:   my_groups [2]
  my_groups final      mean_3
  <chr>     <chr>       <dbl>
1 A         19-01-2021  12.7 
2 A         25-09-2020  11.7 
3 C         13-01-2021   3.67

Summarise that doesn't summarise, but is a bit cleaner in code:

test %>%
  arrange(time) %>%
  group_by(my_groups) %>% 
  summarise(final = rollapply(time, 3, max, by = 3, fill = NA, align = "left", partial = F),
         mean_3 = rollapply(measure, 3, mean, by = 3, fill = NA, align = "left", partial = F)) %>% 
  filter(!is.na(final))

`summarise()` has grouped output by 'my_groups'. You can override using the `.groups` argument.
# A tibble: 3 x 3
# Groups:   my_groups [2]
  my_groups final      mean_3
  <chr>     <chr>       <dbl>
1 A         19-01-2021  12.7 
2 A         25-09-2020  11.7 
3 C         13-01-2021   3.67

Edit:

Added isa's solution from comment. Partial = TRUE does the trick:

test %>% 
  arrange(time) %>% 
  group_by(my_groups) %>% 
  summarise(mean_3 = rollapply(measure, 3, mean, by = 3, align = "left", partial = F), 
            final_data = rollapply(time, 3, max, by = 3, align = "left", partial = T))

`summarise()` has grouped output by 'my_groups'. You can override using the `.groups` argument.
# A tibble: 3 x 3
# Groups:   my_groups [2]
  my_groups mean_3 final_data
  <chr>      <dbl> <chr>     
1 A          12.7  19-01-2021
2 A          11.7  25-09-2020
3 C           3.67 13-01-2021

CodePudding user response:

Another possible solution:

library(tidyverse)

test = data.frame(my_groups = c("A", "A", "A", "B", "B", "C", "C", "C",  "A", "A", "A"),
                  measure = c(10, 20, 5, 2, 62 ,2, 5, 4, 6, 7, 25),
                  time= c("20-09-2020", "25-09-2020", "19-09-2020", "20-05-2020", "20-06-2021", 
                          "11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021", "19-01-2021"))

test %>% 
  group_by(data.table::rleid(my_groups)) %>% 
  filter(n() == 3) %>% 
  summarise(
    groups = unique(my_groups), 
    mean_3 = mean(measure), final_data = max(time), .groups = "drop") %>%
  select(-1)

#> # A tibble: 3 × 3
#>   groups mean_3 final_data
#>   <chr>   <dbl> <chr>     
#> 1 A       11.7  25-09-2020
#> 2 C        3.67 13-01-2021
#> 3 A       12.7  19-01-2021
  • Related