R noob here, please bear with me!
I have a dataset in full tidy form that I'm trying to summarise
. The records each have a state
, year
, and two different categories, which for now we'll say take on levels of A, B, C
and X, Y, Z
:
state year cat1 cat2
AK 2010 A X
AK 2010 B Z
AK 2010 A Y
AK 2010 C Z
...
(For scale, there are ~500k records)
For each state-year combination, I'm trying to get a cross-tab of how many records there are with each combination of cat1
and cat2
level. I can do this with group_by
and summarise
:
data %>%
group_by(state, year, cat1, cat2) %>%
summarise(count = n())
Which yields something like:
state year cat1 cat2 count
AK 2010 A X 32
AK 2010 A Y 501
AK 2010 A Z 109
AK 2010 B X 47
AK 2010 B Y 670
AK 2010 B Z 38
AK 2010 C X 812
AK 2010 C Y 17
AK 2010 C Z 294
...
And so on for each state-year combo.
BUT the problem is that for some state-year combos, certain combinations of cat1
and cat2
don't exist, mostly b/c of small sample sizes. For instance:
state year cat1 cat2 count
RI 2010 A X 10
RI 2010 A Y 17
RI 2010 B Y 4
RI 2010 C X 32
RI 2010 C Z 12
...
In these cases, I would like to force the result of group_by
summarise
to show all nine possible cat1
and cat2
pairs for each state-year combination, filling in zero values if there are no records of that combination. So the example above should then be:
state year cat1 cat2 count
RI 2010 A X 10
RI 2010 A Y 17
RI 2010 A Z 0
RI 2010 B X 0
RI 2010 B Y 4
RI 2010 B Z 0
RI 2010 C X 32
RI 2010 C Y 0
RI 2010 C Z 12
...
How do I force these explicit zero values for the missing combinations?
(I have looked some at complete
but cannot figure out how to make it work here, if it's relevant.)
CodePudding user response:
You can accomplish this with
- coercing the columns you're grouping by to factors
- setting
.drop = FALSE
in thegroup_by
statement.
Example:
library(tidyverse)
df <- tibble(
x = factor(letters[1:5]),
y = factor(c("x", "x", "y", "y", "z")),
z = runif(length(x))
)
df %>%
group_by(x, y) %>%
summarize(z = mean(z), .groups = "drop")
#> # A tibble: 5 × 3
#> x y z
#> <fct> <fct> <dbl>
#> 1 a x 0.0468
#> 2 b x 0.392
#> 3 c y 0.0914
#> 4 d y 0.793
#> 5 e z 0.741
df %>%
group_by(x, y, .drop = FALSE) %>% # attn
summarize(z = mean(z), .groups = "drop")
#> # A tibble: 15 × 3
#> x y z
#> <fct> <fct> <dbl>
#> 1 a x 0.0468
#> 2 a y NaN
#> 3 a z NaN
#> 4 b x 0.392
#> 5 b y NaN
#> 6 b z NaN
#> 7 c x NaN
#> 8 c y 0.0914
#> 9 c z NaN
#> 10 d x NaN
#> 11 d y 0.793
#> 12 d z NaN
#> 13 e x NaN
#> 14 e y NaN
#> 15 e z 0.741
Created on 2022-10-26 with reprex v2.0.2
CodePudding user response:
If you don't want to coerce to a factor, you can use complete
library(tidyverse)
data %>%
count(state, year, cat1, cat2) %>%
complete(expand(., cat1, cat2), fill = list(n = 0)) %>%
fill(state, year) %>%
select(state, year, cat1, cat2, n)
#> # A tibble: 9 x 5
#> state year cat1 cat2 n
#> <chr> <int> <chr> <chr> <int>
#> 1 AK 2010 A X 1
#> 2 AK 2010 A Y 1
#> 3 AK 2010 A Z 0
#> 4 AK 2010 B X 0
#> 5 AK 2010 B Y 0
#> 6 AK 2010 B Z 1
#> 7 AK 2010 C X 0
#> 8 AK 2010 C Y 0
#> 9 AK 2010 C Z 1
Created on 2022-10-26 with reprex v2.0.2
Data taken from question
data <- structure(list(state = c("AK", "AK", "AK", "AK"), year = c(2010L,
2010L, 2010L, 2010L), cat1 = c("A", "B", "A", "C"), cat2 = c("X",
"Z", "Y", "Z")), class = "data.frame", row.names = c(NA, -4L))