Home > OS >  Conditional statements within groups in dplyr
Conditional statements within groups in dplyr

Time:03-07

Using dplyr I would like to summarise groups within a dataset using a conditional statement where the presence of two conditions within a triggers a TRUE value and all other permutations triggers a FALSE. It's best illustrated with an example. Say we have a dataset with several observations of a categorical variable within each id number

df <- data.frame(id = factor(c(1, 2, 2, 3, 3, 4, 4)),
                 l = factor(c("a", "a", "b", "a", "c", "b", "d")))

df

#   id l
# 1  1 a
# 2  2 a
# 3  2 b
# 4  3 a
# 5  3 c
# 6  4 b
# 7  4 d

Now say I want a TRUE to occur only when an id group has BOTH a AND c within it.

I can create a conditional that returns TRUE if the id group has a OR c using the any() function in dplyr

df %>%
  group_by(id) %>%
    summarise(ab = any(l %in% c("a", "c")))

#   id    ab   
#  <fct> <lgl>
# 1 1     TRUE 
# 2 2     TRUE 
# 3 3     TRUE 
# 4 4     FALSE

In the documentation for any() it said all() does the opposite.

library(dplyr)

df %>%
  group_by(id) %>%
    summarise(ab = all(l %in% c("a", "c")))

#   id    ab   
#   <fct> <lgl>
# 1 1     TRUE 
# 2 2     FALSE
# 3 3     TRUE 
# 4 4     FALSE

This is close but not quite right because id number 1 has only one observation and so therefore cannot have both conditions.

Can anyone suggest a solution?

CodePudding user response:

Reverse the %in% statement.

You want to know if "all" of c("a", "c") are in the group, not whether all the group are in c("a", "c")

df %>%
     group_by(id) %>%
     summarise(ab = all(c("a", "c") %in% l))
#> # A tibble: 4 x 2
#>   id    ab   
#>   <fct> <lgl>
#> 1 1     FALSE
#> 2 2     FALSE
#> 3 3     TRUE 
#> 4 4     FALSE
  • Related