Counting distinct occurences in a group by function with different summarise filters-CodePudding

I'm trying to get the number of distinct observations for a variable, and the number of distinct observations for the same variable based on a condition in another variable. All this in a single group_by and summarise operation. Suppose this is my df:

library(tidyverse)


set.seed(1)


sample_size = 20

df = tibble(id = 1:sample_size,
            var_1 = sample(c(1:5), sample_size, replace = TRUE),
            var_2 = sample(c('yes','no','maybe'), sample_size, replace = TRUE),
            month = sample(c(1:12), sample_size, replace = TRUE),
            year = sample(c(2022,2023), sample_size, replace = TRUE)
)

# A tibble: 6 × 5
     id var_1 var_2 month  year
  <int> <int> <chr> <int> <dbl>
1     1     1 yes       8  2023
2     2     4 no        6  2022
3     3     1 no       10  2022
4     4     2 no        7  2022
5     5     5 yes       3  2023
6     6     3 maybe    10  2022

Here's the distinct occurences for each var_1 value:

df %>% 
  group_by(month, year) %>% 
  summarise(distinct_var_1 = var_1 %>% n_distinct()) %>% 
  ungroup()

# A tibble: 6 × 3
  month  year distinct_var_1
  <int> <dbl>          <int>
1     1  2023              1
2     2  2022              1
3     2  2023              1
4     3  2022              2
5     3  2023              1
6     6  2022              2

And the distinct occurences for each var_1 value given that var_2 = 'maybe':

df %>% 
  filter(var_2 == 'maybe') %>% 
  group_by(month, year) %>% 
  summarise(distinct_2 = var_1 %>% n_distinct()) %>%
  ungroup()

# A tibble: 4 × 3
  month  year distinct_2
  <int> <dbl>      <int>
1     1  2023          1
2     6  2023          1
3     8  2022          2
4    10  2022          1

I would like to be able to do both those operations in a single group_by summarise. Any ideas?

CodePudding user response：

You can try

df %>% 
  group_by(month, year) %>% 
  summarise(distinct_var_1 = n_distinct(var_1),
            distinct_2 = n_distinct(var_1[var_2 == 'maybe'])) %>% 
  ungroup()

# # A tibble: 13 × 4
#    month  year distinct_var_1 distinct_2
#    <int> <dbl>          <int>      <int>
#  1     1  2023              1          1
#  2     2  2022              1          0
#  3     2  2023              1          0
#  4     3  2022              2          0
#  5     3  2023              1          0
#  6     6  2022              2          0
#  7     6  2023              2          1
#  8     7  2022              1          0
#  9     7  2023              1          0
# 10     8  2022              2          2
# 11     8  2023              1          0
# 12    10  2022              2          1
# 13    12  2022              1          0

CodePudding user response：

You could extract from var_1 the values which var_2 equal 'maybe' inside the function

df %>% 
  group_by(month, year) %>% 
  summarise(distinct_var_1 = n_distinct(var_1 ), distinct_var_2 = n_distinct(var_1[var_2 == 'maybe'])) %>% 
  ungroup()

Returns:

    # A tibble: 13 × 4
   month  year distinct_var_1 distinct_var_2
   <int> <dbl>          <int>          <int>
 1     1  2023              1              1
 2     2  2022              1              0
 3     2  2023              1              0
 4     3  2022              2              0
 5     3  2023              1              0
 6     6  2022              2              0
 7     6  2023              2              1
 8     7  2022              1              0
 9     7  2023              1              0
10     8  2022              2              2
11     8  2023              1              0
12    10  2022              2              1
13    12  2022              1              0