summarise based on multiple groups in R dplyr-CodePudding

I have a large data frame that looks like this

library(tidyverse)

df <- tibble(id=c(1,1,2,2,2,3), counts=c(10,20,15,15,10,20), fruit=c("apple","banana","cherry","cherry","ananas","pear"))
df
#> # A tibble: 6 × 3
#>      id counts fruit 
#>   <dbl>  <dbl> <chr> 
#> 1     1     10 apple 
#> 2     1     20 banana
#> 3     2     15 cherry
#> 4     2     15 cherry
#> 5     2     10 ananas
#> 6     3     20 pear

^{Created on 2022-04-13 by the reprex package (v2.0.1)}

For each id, I want to keep the fruit with the maximum counts and then I want to add the sum_counts of unique fruits per id in another column.

I want my data to look like this:

# A tibble: 3 × 4
     id central_fruit fruits         sum_counts
  <dbl> <chr>         <chr>               <dbl>
1     1 banana        banana, apple          30
2     2 cherry        cherry, ananas         30
3     3 pear          pear                   20

This is what I have tried so far and I do not know why I fail miserably

library(tidyverse)

df <- tibble(id=c(1,1,2,2,2,3), counts=c(10,20,15,15,15,20), fruit=c("apple","banana","cherry","cherry","ananas","pear"))

df %>% 
  group_by(id,fruit) %>% 
  add_count(fruit) %>% 
  ungroup() %>% 
  group_by(id) %>% 
  summarise(central_fruit=fruit[which.max(counts)],
            fruits = toString(sort(unique(fruit), decreasing = TRUE)),
            sum_counts = sum(unique(counts)))
#> # A tibble: 3 × 4
#>      id central_fruit fruits         sum_counts
#>   <dbl> <chr>         <chr>               <dbl>
#> 1     1 banana        banana, apple          30
#> 2     2 cherry        cherry, ananas         15
#> 3     3 pear          pear                   20

^{Created on 2022-04-13 by the reprex package (v2.0.1)}

CodePudding user response：

Here's a dplyr approach.

library(dplyr)

df <- tibble(id=c(1,1,2,2,2,3), counts=c(10,20,15,15,10,20), fruit=c("apple","banana","cherry","cherry","ananas","pear"))

df %>% 
  group_by(id) %>% 
  mutate(fruits = paste0(unique(fruit), collapse = ", "),
         sum_counts = sum(unique(counts))) %>% 
  filter(counts == max(counts)) %>% 
  distinct() %>% 
  rename("central_fruit" = "fruit") %>% 
  select(-counts)
#> # A tibble: 3 × 4
#> # Groups:   id [3]
#>      id central_fruit fruits         sum_counts
#>   <dbl> <chr>         <chr>               <dbl>
#> 1     1 banana        apple, banana          30
#> 2     2 cherry        cherry, ananas         25
#> 3     3 pear          pear                   20

^{Created on 2022-04-13 by the reprex package (v2.0.1)}

CodePudding user response：

This should work:

df |>
    group_by(id) |>
    distinct(fruit, .keep_all = TRUE) |>
    mutate(
        is_central_fruit = counts == max(counts),
        sum_counts = sum(counts),
        fruits = paste(fruit, collapse = ", ")
    ) |>
    filter(
        is_central_fruit
    )   |>
    select(
        -is_central_fruit, 
        -counts,
        central_fruit = fruit

    )

#      id central_fruit sum_counts fruits
#   <dbl> <chr>              <dbl> <chr>
# 1     1 banana                30 apple, banana
# 2     2 cherry                25 cherry, ananas
# 3     3 pear                  20 pear

If you want to order the fruits column then I wouldn't store fruits as a character vector, but as a list of factors.

CodePudding user response：

And another dplyr approach but preserving the fruits order (central_fruit is first in fruits):

df %>% 
  distinct() %>% 
  group_by(id) %>% 
  mutate(sum_counts = sum(counts)) %>% 
  arrange(id, desc(counts)) %>% 
  mutate(fruits = paste(fruit, collapse = ", ")) %>% 
  slice(1) %>% 
  select(id, central_fruit = fruit, fruits, sum_counts) %>% 
  ungroup()

This returns

# A tibble: 3 x 4
     id central_fruit fruits         sum_counts
  <dbl> <chr>        <chr>               <dbl>
1     1 banana       banana, apple          30
2     2 cherry       cherry, ananas         25
3     3 pear         pear                   20