I have a dataset of only categorical data formatted as factors. We'll call it "mydata".
I want to create a dataframe or tibble of nested contingency tables for all combinations of factors in mydata for review in exploratory data analysis and to easily pass to chisq.test(). I envision this looking like:
| na | var1 | var2 | var3 |
| var1| tibble | tibble | tibble |
| var2| tibble | tibble | tibble |
| var3| tibble | tibble | tibble |
I've tried a few different attempts with dplyr::nest() and purr::map2(). My preference is to keep this tidy if possible.
The closes I've come is below.
mydata <- tibble(var1 = factor(c("a", "b", "c", "c", "b")),
var2 = factor(c("Yes", "No", "Yes", "Yes", "No")),
var3 = factor(c(1, 1, 1, 2, 2))) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "measure") %>%
nest_by(variable) %>%
mutate(test_map = map2(data, data, table))
I feel like this is almost there but the resulting contingency tables include all factor levels instead of just the factor levels for the corresponding target variables (e.g. var1:var1, var1:var2, etc.)
mydata$test_map
$measure
a b c No Yes 1 2
a 1 0 0 0 0 0 0
b 0 2 0 0 0 0 0
c 0 0 2 0 0 0 0
No 0 0 0 0 0 0 0
Yes 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
$measure
a b c No Yes 1 2
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 0 0 0
No 0 0 0 2 0 0 0
Yes 0 0 0 0 3 0 0
1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
$measure
a b c No Yes 1 2
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 0 0 0
No 0 0 0 0 0 0 0
Yes 0 0 0 0 0 0 0
1 0 0 0 0 0 3 0
2 0 0 0 0 0 0 2
CodePudding user response:
How about something like this:
library(tidyverse)
mydata <- tibble(var1 = factor(c("a", "b", "c", "c", "b")),
var2 = factor(c("Yes", "No", "Yes", "Yes", "No")),
var3 = factor(c(1, 1, 1, 2, 2)))
result <- full_join(tibble(name1 = colnames(mydata), id =1),
tibble(name2 = colnames(mydata), id =1),
by = "id") |>
mutate(tbl = map2(name1, name2, ~table(mydata[[.x]], mydata[[.y]]))) |>
select(-id) |>
pivot_wider(names_from = name2, values_from = tbl)
result
#> # A tibble: 3 x 4
#> name1 var1 var2 var3
#> <chr> <list> <list> <list>
#> 1 var1 <table [3 x 3]> <table [3 x 2]> <table [3 x 2]>
#> 2 var2 <table [2 x 3]> <table [2 x 2]> <table [2 x 2]>
#> 3 var3 <table [2 x 3]> <table [2 x 2]> <table [2 x 2]>
result$var1[1]
#> [[1]]
#>
#> a b c
#> a 1 0 0
#> b 0 2 0
#> c 0 0 2
CodePudding user response:
If you create a matrix of cross reference variables taken by pairs you're going to duplicate info. Also the diagonal tables are diagonals.
For example:
set.seed(42)
df = data.frame(var1= sample(c("yes","no"),50,T),
var2 = sample(c("big", "med", "small"), 50, T),
var3 = colors()[sample(4, 50, T)],
var4 = letters[sample(5,50,T)])
head(df)
#> var1 var2 var3 var4
#> 1 yes big white e
#> 2 yes med aliceblue a
#> 3 yes med antiquewhite a
#> 4 yes small antiquewhite1 b
#> 5 no small antiquewhite1 b
#> 6 no big aliceblue d
If you do not want diagonals and duplicated tables, you can create a list for each combination with combn
table
function is what creates the corssreference tables.
l <- combn(names(df),2, function(x)
as.data.frame(unclass(table(df[,x]))), simplify = F)
names(l) <- combn(names(df),2, paste, collapse="-", simplify = F)
l
#> $`var1-var2`
#> big med small
#> no 7 14 6
#> yes 12 7 4
#>
#> $`var1-var3`
#> aliceblue antiquewhite antiquewhite1 white
#> no 8 5 7 7
#> yes 4 6 3 10
#>
#> $`var1-var4`
#> a b c d e
#> no 6 5 5 8 3
#> yes 4 9 3 4 3
#>
#> $`var2-var3`
#> aliceblue antiquewhite antiquewhite1 white
#> big 7 5 2 5
#> med 3 3 6 9
#> small 2 3 2 3
#>
#> $`var2-var4`
#> a b c d e
#> big 1 7 4 4 3
#> med 5 4 2 7 3
#> small 4 3 2 1 0
#>
#> $`var3-var4`
#> a b c d e
#> aliceblue 5 1 2 3 1
#> antiquewhite 3 2 3 1 2
#> antiquewhite1 0 3 1 6 0
#> white 2 8 2 2 3
Nevertheless if you want the full matrix you can use apply
with expand.grid
. Tables in the lower triangle will be the same but transposed:
l <- apply(expand.grid(names(df),names(df)),1,function(x)
as.data.frame(unclass(table(df[,x]))), simplify = F)
names(l) <- apply(expand.grid(names(df),names(df)), 1, paste, collapse = "-")
l
l
#> $`var1-var1`
#> no yes
#> no 27 0
#> yes 0 23
#>
#> $`var2-var1`
#> no yes
#> big 7 12
#> med 14 7
#> small 6 4 ......
Finally if you want nested tibbles you can do:
l <- apply(expand.grid(names(df),names(df)),1,function(x)
tibble::as.tibble(unclass(table(df[,x]))), simplify = F)
tbl <- tibble::tibble(!!!split(l, rep(1:4,4)))
rownames(tbl)<-names(df)
#> Warning: Setting row names on a tibble is deprecated.
colnames(tbl)<-names(df)
tbl
#> # A tibble: 4 × 4
#> var1 var2 var3 var4
#> * <named list> <named list> <named list> <named list>
#> 1 <df [2 × 2]> <df [3 × 2]> <df [4 × 2]> <df [5 × 2]>
#> 2 <df [2 × 3]> <df [3 × 3]> <df [4 × 3]> <df [5 × 3]>
#> 3 <df [2 × 4]> <df [3 × 4]> <df [4 × 4]> <df [5 × 4]>
#> 4 <df [2 × 5]> <df [3 × 5]> <df [4 × 5]> <df [5 × 5]>
But row names appears to be not allowed anymore.