Chi -Square test in R using dplyr-CodePudding

I would like to perform a chi-square test in R using dpylr. Specifically, I would like to investigate whether there is a difference in customer churn between male and female customers. Here a short example of my data.

  sex   churn 
  <fct> <lgl>
 1 W     FALSE
 2 W     FALSE
 3 W     FALSE
 4 W     FALSE
 5 W     FALSE
 6 W     FALSE 
 7 W     FALSE
 8 W     FALSE
 9 W     FALSE
10 W     FALSE
11 W     FALSE
12 W     FALSE
13 M     FALSE
14 W     FALSE
15 W     FALSE
16 W     FALSE
17 W     FALSE
18 M     FALSE
19 W     FALSE
20 W     TRUE 
21 W     TRUE 
22 M     FALSE
23 M     FALSE
24 W     TRUE 
25 W     FALSE

With the summarise and spread function I already get a nice summary table.

churn_latest %>% 
group_by(sex, churn) %>% 
summarise(n = n()) %>% 
spread(key = sex, value =  n)

Now I would like to apply a chi-square test to it, but I always get the following error: 'x' and 'y' must have at least 2 levels. This is of course the case for me, so I must have an error in the syntax.

churn_latest %>% 
group_by(sex, churn) %>% 
summarise(chi = chisq.test(sex, churn))

I would be very happy if someone had a solution to my problem. Many thanks in advance!

CodePudding user response：

You’ll first need to produce a contingency table from your data, which you can then pass to chisq.test. To produce the contingency table using ‘dplyr’ & ‘tidyr’ you can use

churn_latest %>%
    count(sex, churn) %>%
    pivot_wider(names_from = sex, values_from = n, values_fill = 0L)

# A tibble: 2 × 3
  churn     M     W
  <lgl> <int> <int>
1 FALSE     4    18
2 TRUE      0     3

Next, you need to convert this into a matrix (dropping the key column):

… %>%
    select(-churn) %>%
    as.matrix()

     M  W
[1,] 4 18
[2,] 0  3

And that, finally, can be passed to chisq.test. Putting it all together:

churn_latest %>%
    count(sex, churn) %>%
    pivot_wider(names_from = sex, values_from = n, values_fill = 0L) %>%
    select(-churn) %>%
    as.matrix() %>%
    chisq.test()

… to be fair, using ‘dpyr’ and ‘tidyr’ here is a bit overkill. Base R table does the same much more concisely:

churn_latest %>%
    table() %>%
    chisq.test()