Home > Mobile >  aggregate rows with condition in R
aggregate rows with condition in R

Time:09-16

My example

df <- data.frame(id1 = c("a" , "b", "c"),
                 id2 = c("a", "a", "d"),
                 n1 = c(2,2,0),
                 n2 = c(2,1,1),
                 n3 = c(0,1,1),
                 n4 = c(0,1,1))

First, I already aggregated all rows across column like this

df <- df %>%
group_by(id2) %>% 
summarise(across(c(n1,n2,n3,n4), sum, na.rm = TRUE), 
            .groups = "drop")

Now, but now I would like to aggregate only 2 first rows having a in column id2. How we keep the column id1 since my desire output like this. Honestly, that column is just used to compare to id2 and is quite redundant, but I really want to keep it.

id1 id2 n1 n2 n3 n4 
a   a   4  3  1  1
c   d   0  1  1  1

Any suggestions for this?

CodePudding user response:

Change the id2 values where it has 'a' in it.

library(dplyr)

df %>%
  group_by(id1 = ifelse(id2 == 'a', id2, id1), id2) %>% 
  summarise(across(starts_with('n'), sum, na.rm = TRUE), .groups = "drop")

#  id1   id2      n1    n2    n3    n4
#  <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 a     a         4     3     1     1
#2 c     d         0     1     1     1

CodePudding user response:

Other solution would be using case_when. This function is more readable if you need to use multiple casuistic sentences:

library(dplyr)
   df %>%
      mutate(id1 = case_when(
        id2 == 'a' ~ id2,
        TRUE ~ id1
      )) %>% 
      group_by(id1, id2) %>% 
      summarise(across(starts_with('n'), sum, na.rm = TRUE), 
                .groups = "drop")

which yields:

## A tibble: 2 x 6
#  id1   id2      n1    n2    n3    n4
#  <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 a     a         4     3     1     1
#2 c     d         0     1     1     1

Note: The summarise part was copied from @Ronak Shah's answer

  • Related