Home > OS >  Create contingency table that displays the frequency distribution of pairs of variables
Create contingency table that displays the frequency distribution of pairs of variables

Time:11-26

I want to create a contingency table that displays the frequency distribution of pairs of variables. Here is an example dataset:

mm <- matrix(0, 5, 6)
df <- data.frame(apply(mm, c(1,2), function(x) sample(c(0,1),1)))
colnames(df) <- c("Horror", "Thriller", "Comedy", "Romantic", "Sci.fi", "gender")

All variables are binary with 1 indicating either the presence of specfic movie type or the male gender. In the end, I would like to have the table that counts the presence of different movie types under specific gender. Something like this:

           male female
Horror      1      1
Thriller    1      3
Comedy      2      2
Romantic    0      0
Sci.fi      2      0

I know I can create two tables of different movie types for male and female individually (see TarJae's answer here Create count table under specific condition) and cbind them later but I would like to do it in one chunk of code. How to achieve this in an efficient way?

CodePudding user response:

Here is a solution using dplyr and tidyr:

df %>% pivot_longer(cols = -gender, names_to = "type") %>%
  mutate(gender = fct_recode(as.character(gender),Male = "0",Female = "1")) %>% 
  group_by(gender,type) %>% 
  summarise(sum = sum(value)) %>% 
  pivot_wider(names_from = gender,values_from = sum)

Which gives

# A tibble: 5 x 3
  type      Male Female
  <chr>    <dbl>  <dbl>
1 Comedy       0      1
2 Horror       1      3
3 Romantic     1      1
4 Sci.fi       1      1
5 Thriller     1      1

The second line is optional but allows to get the levels for the variable gender.

CodePudding user response:

Please find below a reprex with an alternative solution using data.table and magrittr (for the pipes), also in one chunk.

Reprex

  • Your data (I set a seed for reproducibility)
set.seed(452)
mm <- matrix(0, 5, 6)
df <- data.frame(apply(mm, c(1,2), function(x) sample(c(0,1),1)))
colnames(df) <- c("Horror", "Thriller", "Comedy", "Romantic", "Sci.fi", "gender")
df
#>   Horror Thriller Comedy Romantic Sci.fi gender
#> 1      0        1      1        0      0      0
#> 2      0        0      0        0      1      0
#> 3      1        0      1        1      0      1
#> 4      0        1      0        0      0      1
#> 5      0        1      0        0      0      1
  • Code in one chunk
library(data.table)
library(magrittr) # for the pipes!

df %>% 
  transpose(., keep.names = "rn") %>% 
  setDT(.) %>% 
  {.[,  .(rn = rn,
         male = rowSums(.[,.SD, .SDcols = .[, .SD[.N]] == 1]),
         female = rowSums(.[,.SD, .SDcols = .[, .SD[.N]] == 0]))][rn !="gender"]}
  • Output
#>          rn male female
#> 1:   Horror    1      0
#> 2: Thriller    2      1
#> 3:   Comedy    1      1
#> 4: Romantic    1      0
#> 5:   Sci.fi    0      1

Created on 2021-11-25 by the reprex package (v2.0.1)

CodePudding user response:

You could do

sapply(split(df, df$gender), function(x) colSums(x[names(x)!="gender"]))    

#>          0 1
#> Horror   1 1
#> Thriller 1 3
#> Comedy   0 0
#> Romantic 0 0
#> Sci.fi   1 3
  • Related