I'm learning R and this is my first question on stack overflow. Apreciate if someone can help me.
I'm trying to summarize only the rows based on values of a column. For example, I want to sum the values of groups "A" and "B" for each year in a new group called "AB". I'm doing all data manipulation in dplyr, but couldnt think a way of doing this.
df <- data.frame (year = c(2020, 2020, 2020, 2020, 2021, 2021, 2021, 2021, 2022, 2022, 2022, 2022),
group = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D"),
value = c(10, 5, 8, 6, 12, 20, 17, 3, 6, 15, 12, 5)
)
year group value
1 2020 A 10
2 2020 B 5
3 2020 C 8
4 2020 D 6
5 2021 A 12
6 2021 B 20
7 2021 C 17
8 2021 D 3
9 2022 A 6
10 2022 B 15
11 2022 C 12
12 2022 D 5
I want to do something like this.
year group value
1 2020 AB 15
2 2020 C 8
3 2020 D 6
4 2021 AB 22
5 2021 C 17
6 2021 D 3
7 2022 AB 21
8 2022 C 12
9 2022 D 5
Thank you.
CodePudding user response:
We may replace
the 'A', 'B' with 'AB' and do a group by sum
library(dplyr)
df %>%
group_by(year, group = replace(group, group %in% c("A", "B"), "AB")) %>%
summarise(value = sum(value, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 9 × 3
year group value
<dbl> <chr> <dbl>
1 2020 AB 15
2 2020 C 8
3 2020 D 6
4 2021 AB 32
5 2021 C 17
6 2021 D 3
7 2022 AB 21
8 2022 C 12
9 2022 D 5
CodePudding user response:
The best approach to doing this will depend on exactly what kind of summary statistics you're looking for. In your example you're just summing the value, which is easy: we can just change all those "A"s and "B"s into "AB"s and then summarize:
library(dplyr)
df <- data.frame (year = c(2020, 2020, 2020, 2020, 2021, 2021, 2021, 2021, 2022, 2022, 2022, 2022),
group = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D"),
value = c(10, 5, 8, 6, 12, 20, 17, 3, 6, 15, 12, 5)
)
df %>%
mutate(group = case_when(group %in% c('A','B') ~ 'AB',
TRUE ~ group)) %>%
group_by(year, group) %>%
summarize(value = sum(value))
However, there are alternate statistics you might ask for where it might not work - for example, perhaps you want to get the average of all the As, and the average of the Bs, and then report the average of those two averages for AB. In that case you might be better off splitting thing up and row-binding them back together:
orig_groups <- df %>%
filter(!(group %in% c('A','B'))) %>% # get the groups that don't need anything special
group_by(year, group) %>%
summarize(value = mean(value))
df %>%
filter(group %in% c('A','B')) %>% # Get the groups to combine
group_by(year, group) %>%
summarize(value = mean(value)) %>% # Summarize as desired
mutate(group = case_when(group %in% c('A','B') ~ 'AB',
TRUE ~ group)) %>% # Combine them
group_by(year, group) %>%
summarize(value = mean(value)) %>% # summarize as desired again
bind_rows(orig_groups) # and bring back in the stuff that didn't need this