Home > OS >  `Tidy` solution to getting all combinations of two factors including NAs
`Tidy` solution to getting all combinations of two factors including NAs

Time:07-09

Say I have a dataframe df

df = data.frame(
  a = c(1978, 1978, 1978, 1978, 1978, 1978,
        1979, 1979, 1979, 1980, 1980),
  b = c("A", "A", "A", "B", "B", "C", "A", "B", "C", "A", "C")
)

And I want the number of times each combination of a and b exist in the dataframe. I can do this,

test = df %>% 
  group_by(a, b) %>% 
  summarize(counts = n())

which gives

      a b counts
1  1978 A      3
2  1978 A      3
3  1978 A      3
4  1978 B      2
5  1978 B      2
6  1978 C      1
7  1979 A      1
8  1979 B      1
9  1979 C      1
10 1980 A      1
11 1980 C      1

However, I may want to include the zeros (i.e. the cases where there is no combination, such as a = 1980 and b = B. I can obviously do this with some ugly loops and if-statements, but I'm sure there's a dplyr solution I'm just not thinking of.

Any help much appreciated!

CodePudding user response:

Found the answer right after posting - I don't even need dplyr, table() works just fine:

> data.frame(table(df))
     a b Freq
1 1978 A    3
2 1979 A    1
3 1980 A    1
4 1978 B    2
5 1979 B    1
6 1980 B    0
7 1978 C    1
8 1979 C    1
9 1980 C    1

And if you have a dataframe with more columns you don't want to include like

df = data.frame(
  a = c(1978, 1978, 1978, 1978, 1978, 1978,
        1979, 1979, 1979, 1980, 1980),
  b = c("A", "A", "A", "B", "B", "C", "A", "B", "C", "A", "C"),
  c = c("x", "x", "y", "x", "a", "x", "yc", "a", "c", "b", "z")
)

Then you can still use the same solution, just specify which columns:

> data.frame(table(df[,c("a", "b")]))
     a b Freq
1 1978 A    3
2 1979 A    1
3 1980 A    1
4 1978 B    2
5 1979 B    1
6 1980 B    0
7 1978 C    1
8 1979 C    1
9 1980 C    1

CodePudding user response:

An dplyr option with complete from tidyr:

df = data.frame(
  a = c(1978, 1978, 1978, 1978, 1978, 1978,
        1979, 1979, 1979, 1980, 1980),
  b = c("A", "A", "A", "B", "B", "C", "A", "B", "C", "A", "C")
)

library(dplyr)
library(tidyr)
df %>% 
  group_by(a,b) %>% 
  summarise(counts = n()) %>%
  ungroup() %>%
  complete(a, b, fill = list(counts = 0))
#> `summarise()` has grouped output by 'a'. You can override using the `.groups`
#> argument.
#> # A tibble: 9 × 3
#>       a b     counts
#>   <dbl> <chr>  <int>
#> 1  1978 A          3
#> 2  1978 B          2
#> 3  1978 C          1
#> 4  1979 A          1
#> 5  1979 B          1
#> 6  1979 C          1
#> 7  1980 A          1
#> 8  1980 B          0
#> 9  1980 C          1

Created on 2022-07-08 by the reprex package (v2.0.1)

CodePudding user response:

With a data.table approach:

library(data.table)

res <- setDT(df)[, counts := .N, .(a,b)][
  CJ(a = a, b = b, unique = T), on=.(a, b)]

setnafill(unique(res), fill = 0, cols = "counts")[]

#>       a b counts
#> 1: 1978 A      3
#> 2: 1978 B      2
#> 3: 1978 C      1
#> 4: 1979 A      1
#> 5: 1979 B      1
#> 6: 1979 C      1
#> 7: 1980 A      1
#> 8: 1980 B      0
#> 9: 1980 C      1
  • Related