Home > Back-end >  Chisq.test with group_by
Chisq.test with group_by

Time:11-23

I have a table (already a crosstab), which contains several variables with their categories. I want to run chi-squared tests for each indicator (without looping over the table, if possible). An important thing is that each indicator should be tested separately from other indicators.

Here is my dataset

    table2 <-  data.frame(
  indicator = c("A_3_respondent_hohh", "A_3_respondent_hohh", 
                   "B_2_hh_hosting_displaced_persons", "B_2_hh_hosting_displaced_persons", 
                   "o7_current_settlement_type", "o7_current_settlement_type", "o7_current_settlement_type"
), 
var = c("no", "yes", "no", "yes", "city_smt", "regional_center", 
           "village"), 
n_female = c(3L,  3L, 5L, 1L, 3L, 1L, 2L), 
n_male = c(NA, 9L, 6L, 3L, 3L, 6L, NA), 
n_Overall = c(3L, 12L, 11L, 4L, 6L, 7L, 2L)
)

I use the following approach to handle it

tbl2 <- table2 |>
  select(indicator, var, starts_with("n_"), -c(n_Overall)) |>
  mutate(across(starts_with("n_"), ~replace_na(.x, 0))) |>
  group_by(indicator) |>
  mutate(var  = as.factor(var)) |>
  summarise(chi_st = chisq.test(n_male, n_female)$statistic,
            chi_p = chisq.test(n_male, n_female)$p.value)

It throws an error on "A_3_respondent_hohh" indicator

Error in `summarise()`:
! Problem while computing `chi_st = chisq.test(n_male, n_female)$statistic`.
ℹ The error occurred in group 1: indicator = "A_3_respondent_hohh".
Caused by error in `chisq.test()`:
! 'x' and 'y' must have at least 2 levels

But it works when I run the test on this indicator solely

t2 <- table2 |> filter(indicator == "A_3_respondent_hohh") |>
  select(indicator, var, starts_with("n_"), -c(n_Overall)) |>
  mutate(across(starts_with("n_"), ~replace_na(.x, 0)))
chisq.test(t[, c("n_male", "n_female")])

Thus, the question is how the issue with grouped indicators could be resolved.

Also, strangely, when I run the test on the rest of the indicators within summarize() and individually, the test statistic slightly varies. Why could it happen?

P.S. I know, that Fisher's exact test would be a better solution in this case, but I need a Chi-squared test to be here as well.

CodePudding user response:

In your second example you pass a data.frame with just two columns n_male and n_female to cisq.test(). We can do the same within the group_by(). The difference is that in the group_by() in your first example you use both arguments x and y.

library(tidyverse)

table2 |>
  select(indicator, var, starts_with("n_"), -c(n_Overall)) |>
  mutate(across(starts_with("n_"), ~replace_na(.x, 0))) |>
  group_by(indicator) |>
  mutate(var  = as.factor(var)) |>
  # we first save the result wrapped in `list()`:
  summarise(chi_sq = list(chisq.test(across(starts_with("n_"))))) |>
  # now we can use `rowwise()` and add elements of the test to the data.frame in `mutate()` call:
  rowwise() |>
  mutate(chi_st = chi_sq$statistic,
         chi_p  = chi_sq$p.value)

#> Warning in chisq.test(across(starts_with("n_"))): Chi-squared approximation may
#> be incorrect

#> Warning in chisq.test(across(starts_with("n_"))): Chi-squared approximation may
#> be incorrect

#> Warning in chisq.test(across(starts_with("n_"))): Chi-squared approximation may
#> be incorrect
#> # A tibble: 3 × 4
#> # Rowwise: 
#>   indicator                        chi_sq  chi_st  chi_p
#>   <chr>                            <list>   <dbl>  <dbl>
#> 1 A_3_respondent_hohh              <htest> 2.93   0.0867
#> 2 B_2_hh_hosting_displaced_persons <htest> 0.0142 0.905 
#> 3 o7_current_settlement_type       <htest> 5.18   0.0751

Created on 2022-11-22 with reprex v2.0.2

  • Related