Home > Back-end >  Creating a matrix of nested contingency tables with purr::map
Creating a matrix of nested contingency tables with purr::map

Time:11-05

I have a dataset of only categorical data formatted as factors. We'll call it "mydata".

I want to create a dataframe or tibble of nested contingency tables for all combinations of factors in mydata for review in exploratory data analysis and to easily pass to chisq.test(). I envision this looking like:

| na  |   var1   |   var2   |   var3   |  
| var1| tibble   | tibble   | tibble   |  
| var2| tibble   | tibble   | tibble   |  
| var3| tibble   | tibble   | tibble   |  

I've tried a few different attempts with dplyr::nest() and purr::map2(). My preference is to keep this tidy if possible.

The closes I've come is below.

mydata <- tibble(var1 = factor(c("a", "b", "c", "c", "b")),
               var2 = factor(c("Yes", "No", "Yes", "Yes", "No")),
               var3 = factor(c(1, 1, 1, 2, 2))) %>%
pivot_longer(cols = everything(),
             names_to = "variable",
             values_to = "measure") %>%
nest_by(variable) %>%
mutate(test_map = map2(data, data, table))

I feel like this is almost there but the resulting contingency tables include all factor levels instead of just the factor levels for the corresponding target variables (e.g. var1:var1, var1:var2, etc.)

   mydata$test_map
    $measure
     
      a b c No Yes 1 2
  a   1 0 0  0   0 0 0
  b   0 2 0  0   0 0 0
  c   0 0 2  0   0 0 0
  No  0 0 0  0   0 0 0
  Yes 0 0 0  0   0 0 0
  1   0 0 0  0   0 0 0
  2   0 0 0  0   0 0 0

$measure
     
      a b c No Yes 1 2
  a   0 0 0  0   0 0 0
  b   0 0 0  0   0 0 0
  c   0 0 0  0   0 0 0
  No  0 0 0  2   0 0 0
  Yes 0 0 0  0   3 0 0
  1   0 0 0  0   0 0 0
  2   0 0 0  0   0 0 0

$measure
     
      a b c No Yes 1 2
  a   0 0 0  0   0 0 0
  b   0 0 0  0   0 0 0
  c   0 0 0  0   0 0 0
  No  0 0 0  0   0 0 0
  Yes 0 0 0  0   0 0 0
  1   0 0 0  0   0 3 0
  2   0 0 0  0   0 0 2
    

CodePudding user response:

How about something like this:

library(tidyverse)

mydata <- tibble(var1 = factor(c("a", "b", "c", "c", "b")),
               var2 = factor(c("Yes", "No", "Yes", "Yes", "No")),
               var3 = factor(c(1, 1, 1, 2, 2)))


result <- full_join(tibble(name1 = colnames(mydata), id =1),
          tibble(name2 = colnames(mydata), id =1),
          by = "id") |>
  mutate(tbl = map2(name1, name2, ~table(mydata[[.x]], mydata[[.y]]))) |>
  select(-id) |>
  pivot_wider(names_from = name2, values_from = tbl)

result
#> # A tibble: 3 x 4
#>   name1 var1            var2            var3           
#>   <chr> <list>          <list>          <list>         
#> 1 var1  <table [3 x 3]> <table [3 x 2]> <table [3 x 2]>
#> 2 var2  <table [2 x 3]> <table [2 x 2]> <table [2 x 2]>
#> 3 var3  <table [2 x 3]> <table [2 x 2]> <table [2 x 2]>

result$var1[1]
#> [[1]]
#>    
#>     a b c
#>   a 1 0 0
#>   b 0 2 0
#>   c 0 0 2

CodePudding user response:

If you create a matrix of cross reference variables taken by pairs you're going to duplicate info. Also the diagonal tables are diagonals.

For example:

set.seed(42)

df = data.frame(var1= sample(c("yes","no"),50,T), 
                var2 = sample(c("big", "med", "small"), 50, T), 
                var3 = colors()[sample(4, 50, T)],
                var4 = letters[sample(5,50,T)])

head(df)
#>   var1  var2          var3 var4
#> 1  yes   big         white    e
#> 2  yes   med     aliceblue    a
#> 3  yes   med  antiquewhite    a
#> 4  yes small antiquewhite1    b
#> 5   no small antiquewhite1    b
#> 6   no   big     aliceblue    d

If you do not want diagonals and duplicated tables, you can create a list for each combination with combn table function is what creates the corssreference tables.

l <- combn(names(df),2, function(x) 
  as.data.frame(unclass(table(df[,x]))), simplify = F)

names(l) <- combn(names(df),2, paste, collapse="-", simplify = F)

l
#> $`var1-var2`
#>     big med small
#> no    7  14     6
#> yes  12   7     4
#> 
#> $`var1-var3`
#>     aliceblue antiquewhite antiquewhite1 white
#> no          8            5             7     7
#> yes         4            6             3    10
#> 
#> $`var1-var4`
#>     a b c d e
#> no  6 5 5 8 3
#> yes 4 9 3 4 3
#> 
#> $`var2-var3`
#>       aliceblue antiquewhite antiquewhite1 white
#> big           7            5             2     5
#> med           3            3             6     9
#> small         2            3             2     3
#> 
#> $`var2-var4`
#>       a b c d e
#> big   1 7 4 4 3
#> med   5 4 2 7 3
#> small 4 3 2 1 0
#> 
#> $`var3-var4`
#>               a b c d e
#> aliceblue     5 1 2 3 1
#> antiquewhite  3 2 3 1 2
#> antiquewhite1 0 3 1 6 0
#> white         2 8 2 2 3

Nevertheless if you want the full matrix you can use apply with expand.grid. Tables in the lower triangle will be the same but transposed:

l <- apply(expand.grid(names(df),names(df)),1,function(x) 
  as.data.frame(unclass(table(df[,x]))), simplify = F)

names(l) <- apply(expand.grid(names(df),names(df)), 1, paste, collapse = "-")

l
l
#> $`var1-var1`
#>     no yes
#> no  27   0
#> yes  0  23
#> 
#> $`var2-var1`
#>       no yes
#> big    7  12
#> med   14   7
#> small  6   4 ......

Finally if you want nested tibbles you can do:

l <- apply(expand.grid(names(df),names(df)),1,function(x) 
  tibble::as.tibble(unclass(table(df[,x]))), simplify = F)

tbl <- tibble::tibble(!!!split(l, rep(1:4,4)))
rownames(tbl)<-names(df)

#> Warning: Setting row names on a tibble is deprecated.
colnames(tbl)<-names(df)

tbl
#> # A tibble: 4 × 4
#>   var1         var2         var3         var4        
#> * <named list> <named list> <named list> <named list>
#> 1 <df [2 × 2]> <df [3 × 2]> <df [4 × 2]> <df [5 × 2]>
#> 2 <df [2 × 3]> <df [3 × 3]> <df [4 × 3]> <df [5 × 3]>
#> 3 <df [2 × 4]> <df [3 × 4]> <df [4 × 4]> <df [5 × 4]>
#> 4 <df [2 × 5]> <df [3 × 5]> <df [4 × 5]> <df [5 × 5]>

But row names appears to be not allowed anymore.

  • Related