Home > Software engineering >  Force explicit zero values in summarise n()
Force explicit zero values in summarise n()

Time:10-27

R noob here, please bear with me!

I have a dataset in full tidy form that I'm trying to summarise. The records each have a state, year, and two different categories, which for now we'll say take on levels of A, B, C and X, Y, Z:

state    year    cat1    cat2
AK       2010    A       X
AK       2010    B       Z
AK       2010    A       Y
AK       2010    C       Z
...

(For scale, there are ~500k records)

For each state-year combination, I'm trying to get a cross-tab of how many records there are with each combination of cat1 and cat2 level. I can do this with group_by and summarise:

data %>%
    group_by(state, year, cat1, cat2) %>%
    summarise(count = n())

Which yields something like:

state    year    cat1    cat2    count
AK       2010    A       X       32
AK       2010    A       Y       501
AK       2010    A       Z       109
AK       2010    B       X       47
AK       2010    B       Y       670
AK       2010    B       Z       38
AK       2010    C       X       812
AK       2010    C       Y       17
AK       2010    C       Z       294
...

And so on for each state-year combo.

BUT the problem is that for some state-year combos, certain combinations of cat1 and cat2 don't exist, mostly b/c of small sample sizes. For instance:

state    year    cat1    cat2    count
RI       2010    A       X       10
RI       2010    A       Y       17
RI       2010    B       Y       4
RI       2010    C       X       32
RI       2010    C       Z       12
...

In these cases, I would like to force the result of group_by summarise to show all nine possible cat1 and cat2 pairs for each state-year combination, filling in zero values if there are no records of that combination. So the example above should then be:

state    year    cat1    cat2    count
RI       2010    A       X       10
RI       2010    A       Y       17
RI       2010    A       Z       0
RI       2010    B       X       0
RI       2010    B       Y       4
RI       2010    B       Z       0
RI       2010    C       X       32
RI       2010    C       Y       0
RI       2010    C       Z       12
...

How do I force these explicit zero values for the missing combinations? (I have looked some at complete but cannot figure out how to make it work here, if it's relevant.)

CodePudding user response:

You can accomplish this with

  • coercing the columns you're grouping by to factors
  • setting .drop = FALSE in the group_by statement.

Example:

library(tidyverse)

df <- tibble(
  x = factor(letters[1:5]),
  y = factor(c("x", "x", "y", "y", "z")),
  z = runif(length(x))
)

df %>% 
  group_by(x, y) %>%
  summarize(z = mean(z), .groups = "drop")
#> # A tibble: 5 × 3
#>   x     y          z
#>   <fct> <fct>  <dbl>
#> 1 a     x     0.0468
#> 2 b     x     0.392 
#> 3 c     y     0.0914
#> 4 d     y     0.793 
#> 5 e     z     0.741

df %>% 
  group_by(x, y, .drop = FALSE) %>%  # attn
  summarize(z = mean(z), .groups = "drop")
#> # A tibble: 15 × 3
#>    x     y            z
#>    <fct> <fct>    <dbl>
#>  1 a     x       0.0468
#>  2 a     y     NaN     
#>  3 a     z     NaN     
#>  4 b     x       0.392 
#>  5 b     y     NaN     
#>  6 b     z     NaN     
#>  7 c     x     NaN     
#>  8 c     y       0.0914
#>  9 c     z     NaN     
#> 10 d     x     NaN     
#> 11 d     y       0.793 
#> 12 d     z     NaN     
#> 13 e     x     NaN     
#> 14 e     y     NaN     
#> 15 e     z       0.741

Created on 2022-10-26 with reprex v2.0.2

CodePudding user response:

If you don't want to coerce to a factor, you can use complete

library(tidyverse)

data %>%
  count(state, year, cat1, cat2) %>%
  complete(expand(., cat1, cat2), fill = list(n = 0)) %>%
  fill(state, year) %>%
  select(state, year, cat1, cat2, n)
#> # A tibble: 9 x 5
#>   state  year cat1  cat2      n
#>   <chr> <int> <chr> <chr> <int>
#> 1 AK     2010 A     X         1
#> 2 AK     2010 A     Y         1
#> 3 AK     2010 A     Z         0
#> 4 AK     2010 B     X         0
#> 5 AK     2010 B     Y         0
#> 6 AK     2010 B     Z         1
#> 7 AK     2010 C     X         0
#> 8 AK     2010 C     Y         0
#> 9 AK     2010 C     Z         1

Created on 2022-10-26 with reprex v2.0.2


Data taken from question

data <- structure(list(state = c("AK", "AK", "AK", "AK"), year = c(2010L, 
2010L, 2010L, 2010L), cat1 = c("A", "B", "A", "C"), cat2 = c("X", 
"Z", "Y", "Z")), class = "data.frame", row.names = c(NA, -4L))
  • Related