Home > database >  How to aggregate a data frame based on the max value of the group in R
How to aggregate a data frame based on the max value of the group in R

Time:03-17

I have a large data with many groups that looks like this. I want in each group to use the fruit with the most counts as the central fruit, and aggregate the other fruits based on it!

library(tidyverse)

df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
             col2 = c("pple","app","app", "bananna", "banan", "banan"), 
             counts_col1 = c(100,100,2,200,200,2),
             counts_col2 = c(2,50,50,2,20,20),
             id=c(1,1,1,2,2,2))

df
#> # A tibble: 6 × 5
#>   col1    col2    counts_col1 counts_col2    id
#>   <chr>   <chr>         <dbl>       <dbl> <dbl>
#> 1 apple   pple            100           2     1
#> 2 apple   app             100          50     1
#> 3 pple    app               2          50     1
#> 4 banana  bananna         200           2     2
#> 5 banana  banan           200          20     2
#> 6 bananna banan             2          20     2

Created on 2022-03-16 by the reprex package (v2.0.1)

I want my data frame to look like this

id  central_fruit   fruits                 counts     sum_counts
 1     apple        apple,pple,app         100,50,2        152
 2    banana        banana,bananna,banan   200,20,2        222

The format of the output it does not have to be like this. This is just an example. It can be a list of characters or just characters. Any help or guidance is appreciated

CodePudding user response:

We may do this by first reshaping to 'long' format (pivot_longer), grouped by 'id', 'grp', create a frequency count (add_count), then summarise the 'central_fruit' which had the max frequency by 'id', and similarly paste (toString) the unique fruit, and unique count along with sum of unique count

library(dplyr)
library(stringr)
library(tidyr)
df %>%
   rename_with(~ str_c("fruit_", .x), starts_with('col')) %>% 
   pivot_longer(cols = -id, names_to = c(".value", "grp"), 
     names_pattern = "(.*)_(col\\d )") %>% 
   group_by(id, grp) %>%
   add_count(fruit) %>%
   group_by(id) %>% 
   summarise(central_fruit = fruit[which.max(n)], 
      fruits = toString(unique(fruit)), 
      sum_counts = sum(unique(counts)),
      counts = toString(sort(unique(counts), decreasing = TRUE)),
        .groups = 'drop' ) %>%
     relocate(counts, .before = 'sum_counts')

-output

# A tibble: 2 × 5
     id central_fruit fruits                 counts     sum_counts
  <dbl> <chr>         <chr>                  <chr>           <dbl>
1     1 apple         apple, pple, app       100, 50, 2        152
2     2 banana        banana, bananna, banan 200, 20, 2        222

NOTE: It may be better to wrap the values of 'counts' in a list instead of pasteing. i.e. instead of counts = toString(sort(unique(counts), decreasing = TRUE)), it would be counts = list(sort(unique(counts), decreasing = TRUE))

CodePudding user response:

Using data.table, you could do:

Reprex

  • Code
library(tidyverse) # to read your tibble
library(data.table)

setDT(df)[, .(central_fruit = col1[which.max(counts_col1)],
              fruits = .(unique(c(col1, col2))),
              counts = .(sort(unique(c(counts_col1, counts_col2)), decreasing = TRUE)),
              sum_counts = unlist(lapply(.(unique(c(counts_col1, counts_col2))), sum))), 
          by = id]
  • Output
#>       id central_fruit               fruits      counts sum_counts
#>    <num>        <char>               <list>      <list>      <num>
#> 1:     1         apple       apple,pple,app 100, 50,  2        152
#> 2:     2        banana banana,bananna,banan 200, 20,  2        222

Created on 2022-03-16 by the reprex package (v2.0.1)

  • Related