I have a large data with many groups that looks like this. I want in each group to use the fruit with the most counts as the central fruit, and aggregate the other fruits based on it!
library(tidyverse)
df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
col2 = c("pple","app","app", "bananna", "banan", "banan"),
counts_col1 = c(100,100,2,200,200,2),
counts_col2 = c(2,50,50,2,20,20),
id=c(1,1,1,2,2,2))
df
#> # A tibble: 6 × 5
#> col1 col2 counts_col1 counts_col2 id
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 apple pple 100 2 1
#> 2 apple app 100 50 1
#> 3 pple app 2 50 1
#> 4 banana bananna 200 2 2
#> 5 banana banan 200 20 2
#> 6 bananna banan 2 20 2
Created on 2022-03-16 by the reprex package (v2.0.1)
I want my data frame to look like this
id central_fruit fruits counts sum_counts
1 apple apple,pple,app 100,50,2 152
2 banana banana,bananna,banan 200,20,2 222
The format of the output it does not have to be like this. This is just an example. It can be a list of characters or just characters. Any help or guidance is appreciated
CodePudding user response:
We may do this by first reshaping to 'long' format (pivot_longer
), grouped by 'id', 'grp', create a frequency count (add_count
), then summarise
the 'central_fruit' which had the max
frequency by 'id', and similarly paste
(toString
) the unique
fruit, and unique
count along with sum
of unique
count
library(dplyr)
library(stringr)
library(tidyr)
df %>%
rename_with(~ str_c("fruit_", .x), starts_with('col')) %>%
pivot_longer(cols = -id, names_to = c(".value", "grp"),
names_pattern = "(.*)_(col\\d )") %>%
group_by(id, grp) %>%
add_count(fruit) %>%
group_by(id) %>%
summarise(central_fruit = fruit[which.max(n)],
fruits = toString(unique(fruit)),
sum_counts = sum(unique(counts)),
counts = toString(sort(unique(counts), decreasing = TRUE)),
.groups = 'drop' ) %>%
relocate(counts, .before = 'sum_counts')
-output
# A tibble: 2 × 5
id central_fruit fruits counts sum_counts
<dbl> <chr> <chr> <chr> <dbl>
1 1 apple apple, pple, app 100, 50, 2 152
2 2 banana banana, bananna, banan 200, 20, 2 222
NOTE: It may be better to wrap the values of 'counts' in a list
instead of paste
ing. i.e. instead of counts = toString(sort(unique(counts), decreasing = TRUE))
, it would be
counts = list(sort(unique(counts), decreasing = TRUE))
CodePudding user response:
Using data.table
, you could do:
Reprex
- Code
library(tidyverse) # to read your tibble
library(data.table)
setDT(df)[, .(central_fruit = col1[which.max(counts_col1)],
fruits = .(unique(c(col1, col2))),
counts = .(sort(unique(c(counts_col1, counts_col2)), decreasing = TRUE)),
sum_counts = unlist(lapply(.(unique(c(counts_col1, counts_col2))), sum))),
by = id]
- Output
#> id central_fruit fruits counts sum_counts
#> <num> <char> <list> <list> <num>
#> 1: 1 apple apple,pple,app 100, 50, 2 152
#> 2: 2 banana banana,bananna,banan 200, 20, 2 222
Created on 2022-03-16 by the reprex package (v2.0.1)