How do I summarize only the rows that match certain criteria-CodePudding

I'm learning R and this is my first question on stack overflow. Apreciate if someone can help me.

I'm trying to summarize only the rows based on values of a column. For example, I want to sum the values of groups "A" and "B" for each year in a new group called "AB". I'm doing all data manipulation in dplyr, but couldnt think a way of doing this.

df <- data.frame (year  = c(2020, 2020, 2020, 2020, 2021, 2021, 2021, 2021, 2022, 2022, 2022, 2022),
                  group = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D"),
                  value = c(10, 5, 8, 6, 12, 20, 17, 3, 6, 15, 12, 5)
                  )

 year group value
1  2020     A    10
2  2020     B     5
3  2020     C     8
4  2020     D     6
5  2021     A    12
6  2021     B    20
7  2021     C    17
8  2021     D     3
9  2022     A     6
10 2022     B    15
11 2022     C    12
12 2022     D     5

I want to do something like this.

  year group value
1 2020    AB    15
2 2020     C     8
3 2020     D     6
4 2021    AB    22
5 2021     C    17
6 2021     D     3
7 2022    AB    21
8 2022     C    12
9 2022     D     5

Thank you.

CodePudding user response：

We may replace the 'A', 'B' with 'AB' and do a group by sum

library(dplyr)
df %>%
   group_by(year, group = replace(group, group %in% c("A", "B"), "AB")) %>% 
   summarise(value = sum(value, na.rm = TRUE), .groups = 'drop')

-output

# A tibble: 9 × 3
   year group value
  <dbl> <chr> <dbl>
1  2020 AB       15
2  2020 C         8
3  2020 D         6
4  2021 AB       32
5  2021 C        17
6  2021 D         3
7  2022 AB       21
8  2022 C        12
9  2022 D         5

CodePudding user response：

The best approach to doing this will depend on exactly what kind of summary statistics you're looking for. In your example you're just summing the value, which is easy: we can just change all those "A"s and "B"s into "AB"s and then summarize:

library(dplyr)
df <- data.frame (year  = c(2020, 2020, 2020, 2020, 2021, 2021, 2021, 2021, 2022, 2022, 2022, 2022),
                  group = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D"),
                  value = c(10, 5, 8, 6, 12, 20, 17, 3, 6, 15, 12, 5)
                  )
df %>%
 mutate(group = case_when(group %in% c('A','B') ~ 'AB', 
                          TRUE ~ group)) %>%
 group_by(year, group) %>%
 summarize(value = sum(value))

However, there are alternate statistics you might ask for where it might not work - for example, perhaps you want to get the average of all the As, and the average of the Bs, and then report the average of those two averages for AB. In that case you might be better off splitting thing up and row-binding them back together:

orig_groups <- df %>%
  filter(!(group %in% c('A','B'))) %>% # get the groups that don't need anything special
  group_by(year, group) %>%
  summarize(value = mean(value))

df %>%
  filter(group %in% c('A','B')) %>% # Get the groups to combine
  group_by(year, group) %>%
  summarize(value = mean(value)) %>% # Summarize as desired
  mutate(group = case_when(group %in% c('A','B') ~ 'AB', 
                           TRUE ~ group)) %>% # Combine them
  group_by(year, group) %>%
  summarize(value = mean(value)) %>% # summarize as desired again
  bind_rows(orig_groups) # and bring back in the stuff that didn't need this