I'm trying to get the number of distinct observations for a variable, and the number of distinct observations for the same variable based on a condition in another variable. All this in a single group_by
and summarise
operation. Suppose this is my df:
library(tidyverse)
set.seed(1)
sample_size = 20
df = tibble(id = 1:sample_size,
var_1 = sample(c(1:5), sample_size, replace = TRUE),
var_2 = sample(c('yes','no','maybe'), sample_size, replace = TRUE),
month = sample(c(1:12), sample_size, replace = TRUE),
year = sample(c(2022,2023), sample_size, replace = TRUE)
)
# A tibble: 6 × 5
id var_1 var_2 month year
<int> <int> <chr> <int> <dbl>
1 1 1 yes 8 2023
2 2 4 no 6 2022
3 3 1 no 10 2022
4 4 2 no 7 2022
5 5 5 yes 3 2023
6 6 3 maybe 10 2022
Here's the distinct occurences for each var_1 value:
df %>%
group_by(month, year) %>%
summarise(distinct_var_1 = var_1 %>% n_distinct()) %>%
ungroup()
# A tibble: 6 × 3
month year distinct_var_1
<int> <dbl> <int>
1 1 2023 1
2 2 2022 1
3 2 2023 1
4 3 2022 2
5 3 2023 1
6 6 2022 2
And the distinct occurences for each var_1 value given that var_2 = 'maybe':
df %>%
filter(var_2 == 'maybe') %>%
group_by(month, year) %>%
summarise(distinct_2 = var_1 %>% n_distinct()) %>%
ungroup()
# A tibble: 4 × 3
month year distinct_2
<int> <dbl> <int>
1 1 2023 1
2 6 2023 1
3 8 2022 2
4 10 2022 1
I would like to be able to do both those operations in a single group_by summarise. Any ideas?
CodePudding user response:
You can try
df %>%
group_by(month, year) %>%
summarise(distinct_var_1 = n_distinct(var_1),
distinct_2 = n_distinct(var_1[var_2 == 'maybe'])) %>%
ungroup()
# # A tibble: 13 × 4
# month year distinct_var_1 distinct_2
# <int> <dbl> <int> <int>
# 1 1 2023 1 1
# 2 2 2022 1 0
# 3 2 2023 1 0
# 4 3 2022 2 0
# 5 3 2023 1 0
# 6 6 2022 2 0
# 7 6 2023 2 1
# 8 7 2022 1 0
# 9 7 2023 1 0
# 10 8 2022 2 2
# 11 8 2023 1 0
# 12 10 2022 2 1
# 13 12 2022 1 0
CodePudding user response:
You could extract from var_1 the values which var_2 equal 'maybe' inside the function
df %>%
group_by(month, year) %>%
summarise(distinct_var_1 = n_distinct(var_1 ), distinct_var_2 = n_distinct(var_1[var_2 == 'maybe'])) %>%
ungroup()
Returns:
# A tibble: 13 × 4
month year distinct_var_1 distinct_var_2
<int> <dbl> <int> <int>
1 1 2023 1 1
2 2 2022 1 0
3 2 2023 1 0
4 3 2022 2 0
5 3 2023 1 0
6 6 2022 2 0
7 6 2023 2 1
8 7 2022 1 0
9 7 2023 1 0
10 8 2022 2 2
11 8 2023 1 0
12 10 2022 2 1
13 12 2022 1 0