Home > Back-end >  Prop.test on count data with multi-level factors in R
Prop.test on count data with multi-level factors in R

Time:02-10

I'm having trouble figuring out if prop.test can run this much data (See below) or if I would need to run the code for each level of the factor "Zone" individually. So far I've seen many examples written in this format, but I have many more factor levels:

# Whipray "Zone"
prop.test(c(4,4,0), c(9,7,15))

I'd like to know if there are statistically different proportions of fish with empty stomachs over the years and between locations all at once, i.e test the null that there are no differences at any point in time or place (along with a pairwise test to see where those differences are, if they exist).

Count of fish with empty stomachs

> table1 <- xtabs(empty_count ~ Zone   Year, data = df)
> table1
           Year
Zone        2016 2017 2018
  Crocodile    0    8    2
  Rankin       3   17    8
  West         7   31   17
  Whipray      4    4    0

Count of all fish caught

> table2 <- xtabs(total_count ~ Zone   Year, data = df)
> table2
           Year
Zone        2016 2017 2018
  Crocodile    1   18    7
  Rankin      14   46   69
  West        29   67   58
  Whipray      9    7   15

CodePudding user response:

I think I've managed to reverse-engineer your original data frame from the cross-tabs:

df
#>         Zone Year total_count empty_count
#> 1  Crocodile 2016           1           0
#> 2     Rankin 2016          14           3
#> 3       West 2016          29           7
#> 4    Whipray 2016           9           4
#> 5  Crocodile 2017          18           8
#> 6     Rankin 2017          46          17
#> 7       West 2017          67          31
#> 8    Whipray 2017           7           4
#> 9  Crocodile 2018           7           2
#> 10    Rankin 2018          69           8
#> 11      West 2018          58          17
#> 12   Whipray 2018          15           0

It seems to me that rather than attempting multiple pairwise comparisons you could carry out a single logistic regression to find out where the significant differences are. Just ensure you have a "non-empty" count column, and that your years are factors:

df$non_empty_count <- df$total_count - df$empty_count
df$Year <- as.factor(df$Year)

And your logistic regression looks like this:

model <- glm(cbind(empty_count, non_empty_count) ~ Zone   Year, 
             data = df, family = binomial)

summary(model)
#> 
#> Call:
#> glm(formula = cbind(empty_count, non_empty_count) ~ Zone   Year, 
#>     family = binomial, data = df)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -2.47563  -0.59548   0.04512   0.56564   1.26509  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)  
#> (Intercept)  -0.9588     0.5360  -1.789   0.0737 .
#> ZoneRankin   -0.4864     0.4740  -1.026   0.3048  
#> ZoneWest      0.1281     0.4552   0.281   0.7784  
#> ZoneWhipray  -0.1395     0.6038  -0.231   0.8173  
#> Year2017      0.7975     0.3659   2.180   0.0293 *
#> Year2018     -0.3861     0.3831  -1.008   0.3136  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 39.255  on 11  degrees of freedom
#> Residual deviance: 11.800  on  6  degrees of freedom
#> AIC: 57.905
#> 
#> Number of Fisher Scoring iterations: 5

Which you can interpret as saying that while the proportion of empty stomachs didn't vary significantly between sites, there was a significantly higher proportion of fish with empty stomachs in 2017 compared to other years across all sites.


Reproducible data

df <- structure(list(Zone = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 
4L, 1L, 2L, 3L, 4L), .Label = c("Crocodile", "Rankin", "West", 
"Whipray"), class = "factor"), Year = c(2016, 2016, 2016, 2016, 
2017, 2017, 2017, 2017, 2018, 2018, 2018, 2018), total_count = c(1L, 
14L, 29L, 9L, 18L, 46L, 67L, 7L, 7L, 69L, 58L, 15L), empty_count = c(0L, 
3L, 7L, 4L, 8L, 17L, 31L, 4L, 2L, 8L, 17L, 0L)), row.names = c(NA, 
-12L), class = "data.frame")

Created on 2022-02-09 by the reprex package (v2.0.1)

  • Related