In R, what's the easiest way to get counts by group and overall counts in the same output?-CodePudding

I'm trying to get a count of the number of students of each gender by class, but I also want the number of students identifying as each gender overall. The desired output is one object that has the overall and by class gender breakdowns.

I have working code (below) that does this, but I wasn't sure if there was a more streamlined way to accomplish this task without creating an intermediary object and joining them together.

library(dplyr)
#Sample dataset
test_data <- tibble(id = c(1, 1, 2, 2, 2, 3, 3, 3),
                    class = c("h", "h", "m", "h", "s", "m", "h", "h"),
                    gender = c("m", "m", "f", "f", "f", "m", "m", "m"))

#My code to accomplish this task now (produces desired output but curious if there's a more efficient method)
gender_by_class <- test_data %>%
  distinct(id, class, gender) %>%
  group_by(class) %>%
  count(gender) %>%
  ungroup()

gender_overall <- test_data %>%
  distinct(id, gender) %>%
  count(gender) %>%
  mutate(class = "overall") %>%
  full_join(gender_by_class)

CodePudding user response：

Similar to @Quinten's approach, but with n_distinct:

library(dplyr)

test_data %>%
  group_by(gender) %>%
  summarise(n = n_distinct(id), class = 'overall') %>%
  bind_rows(
    test_data %>%
      group_by(class, gender) %>%
      summarise(n = n_distinct(id))
  )

Output:

# A tibble: 7 × 3
  gender     n class  
  <chr>  <int> <chr>  
1 f          1 overall
2 m          2 overall
3 f          1 h      
4 m          2 h      
5 f          1 m      
6 m          1 m      
7 f          1 s

CodePudding user response：

You could use bind_rows to have it in one pipe like this:

library(dplyr)

test_data %>%
  distinct(id, class, gender) %>%
  group_by(class) %>%
  count(gender) %>%
  ungroup() %>%
  bind_rows(., test_data %>%
              distinct(id, gender) %>%
              count(gender) %>% 
              mutate(class = "overall")) 
#> # A tibble: 7 × 3
#>   class   gender     n
#>   <chr>   <chr>  <int>
#> 1 h       f          1
#> 2 h       m          2
#> 3 m       f          1
#> 4 m       m          1
#> 5 s       f          1
#> 6 overall f          1
#> 7 overall m          2

^{Created on 2023-01-29 with reprex v2.0.2}

Thanks to @stefan, an even better option:

library(dplyr)

test_data %>%
  distinct(id, class, gender) %>%
  count(class, gender) %>%
  bind_rows(., test_data %>%
              distinct(id, gender) %>%
              count(class = "overall", gender))
#> # A tibble: 7 × 3
#>   class   gender     n
#>   <chr>   <chr>  <int>
#> 1 h       f          1
#> 2 h       m          2
#> 3 m       f          1
#> 4 m       m          1
#> 5 s       f          1
#> 6 overall f          1
#> 7 overall m          2

^{Created on 2023-01-29 with reprex v2.0.2}