Home > Software design >  How do I perform chi square tests between many variables and create a data frame of the results?
How do I perform chi square tests between many variables and create a data frame of the results?

Time:01-03

I'm still new to R and data analytics in general. I have a data set containing 2 parts:

  1. 20 questions (the answers of which are in 5 point likert scale format)
  2. 8 socio-demographic variables

Here is a scaled down sample version of the data set (only contains 3 of the 20 questions and 3 socio-demographic variables) in case it is needed:

data.frame(Q1 = c(1, 2, 2, 1, 3, 4, 3, 5, 2, 2),
           Q2 = c(2, 3, 5, 5, 4, 5, 1, 1, 5, 3),
           Q3 = c(4, 4, 2, 3, 2, 1, 1, 1, 5, 5), 
           ageRange = c(2, 3, 1, 1, 3, 4, 4, 2, 1, 1),
           education = c(1, 1, 3, 4, 6, 5, 3, 2, 1, 4),
           maritalStatus = c(1, 0, 0, 0, 0, 1, 1, 0, 0, 1))

  1. I need to apply a chi square test that relates each question to all the socio demographic variables. That would be a total of 9 chi square results: Q1 - ageRange, Q1 - education, Q1 - maritalStatus, Q2 - ageRange, Q2 - education, Q2 - maritalStatus, Q3 - ageRange, Q3 - education, Q3 - maritalStatus
  2. I want to arrange the results of the chi square pairings into a data frame or matrix where the columns would be the 3 socio demographic factors and the rows would be the 3 questions. It should look something like this (just replace each 0 with the corresponding p-values for each of the row and column pairs):
data.frame(Age = c(0, 0, 0),
           Education = c(0, 0, 0), 
           Married = c(0, 0, 0), row.names = c("Q1", "Q2", "Q3")) 

I tried using some of the apply functions, but I could not get it to work.

CodePudding user response:

We could do something like this. This quite verbose, but for the start it may help:

What we do here is in principle create new data frames with each one of the Q columns and the others. And for each Q we do the same and bind them at the end.

Quite handy is the tidy function from broom package:

library(dplyr)
library(tidyr)
library(broom)

Q1 <- df %>% 
  select(-Q2, -Q3) %>% 
  pivot_longer(-Q1) %>% 
  group_by(name) %>% 
  nest(-name) %>% 
  mutate(stats = map(data, ~broom::tidy(chisq.test(.$Q1, .$value)))) %>% 
  select(-data) %>% 
  unnest(c(stats))

Q2 <- df %>% 
  select(-Q1, -Q3) %>% 
  pivot_longer(-Q2) %>% 
  group_by(name) %>% 
  nest(-name) %>% 
  mutate(stats = map(data, ~broom::tidy(chisq.test(.$Q2, .$value)))) %>% 
  select(-data) %>% 
  unnest(c(stats))

Q3 <- df %>% 
  select(-Q1, -Q2) %>% 
  pivot_longer(-Q3) %>% 
  group_by(name) %>% 
  nest(-name) %>% 
  mutate(stats = map(data, ~broom::tidy(chisq.test(.$Q3, .$value)))) %>% 
  select(-data) %>% 
  unnest(c(stats))

bind_rows(Q1, Q2, Q3, .id = "Q") %>% 
mutate(ID = paste0("Q",Q), .before=1, .keep="unused")
  ID    name          statistic p.value parameter method                    
  <chr> <chr>             <dbl>   <dbl>     <int> <chr>                     
1 Q1    ageRange          15.6    0.209        12 Pearson's Chi-squared test
2 Q1    education         27.5    0.122        20 Pearson's Chi-squared test
3 Q1    maritalStatus      2.71   0.608         4 Pearson's Chi-squared test
4 Q2    ageRange          15.6    0.209        12 Pearson's Chi-squared test
5 Q2    education         20.8    0.407        20 Pearson's Chi-squared test
6 Q2    maritalStatus      2.71   0.608         4 Pearson's Chi-squared test
7 Q3    ageRange          14.6    0.265        12 Pearson's Chi-squared test
8 Q3    education         21.7    0.359        20 Pearson's Chi-squared test
9 Q3    maritalStatus      3.06   0.549         4 Pearson's Chi-squared test

CodePudding user response:

We may use a loop as well

library(purrr)
library(broom)
library(tidyr)
library(stringr)
library(dplyr)
str_subset(names(df), "^Q\\d $") %>%
   map(~ df %>% 
    select(all_of(.x), ageRange:maritalStatus) %>%
    pivot_longer(cols = -1) %>% 
   group_by(ID = .x, name) %>% 
   summarise(stats = tidy(chisq.test(cur_data()[[1]], value)),
       .groups = "drop")) %>% 
   list_rbind %>%
   unnest(where(is_tibble))

-output

# A tibble: 9 × 6
  ID    name          statistic p.value parameter method                    
  <chr> <chr>             <dbl>   <dbl>     <int> <chr>                     
1 Q1    ageRange          15.6    0.209        12 Pearson's Chi-squared test
2 Q1    education         27.5    0.122        20 Pearson's Chi-squared test
3 Q1    maritalStatus      2.71   0.608         4 Pearson's Chi-squared test
4 Q2    ageRange          15.6    0.209        12 Pearson's Chi-squared test
5 Q2    education         20.8    0.407        20 Pearson's Chi-squared test
6 Q2    maritalStatus      2.71   0.608         4 Pearson's Chi-squared test
7 Q3    ageRange          14.6    0.265        12 Pearson's Chi-squared test
8 Q3    education         21.7    0.359        20 Pearson's Chi-squared test
9 Q3    maritalStatus      3.06   0.549         4 Pearson's Chi-squared test
  • Related