I'm having trouble figuring out if prop.test can run this much data (See below) or if I would need to run the code for each level of the factor "Zone" individually. So far I've seen many examples written in this format, but I have many more factor levels:
# Whipray "Zone"
prop.test(c(4,4,0), c(9,7,15))
I'd like to know if there are statistically different proportions of fish with empty stomachs over the years and between locations all at once, i.e test the null that there are no differences at any point in time or place (along with a pairwise test to see where those differences are, if they exist).
Count of fish with empty stomachs
> table1 <- xtabs(empty_count ~ Zone Year, data = df)
> table1
Year
Zone 2016 2017 2018
Crocodile 0 8 2
Rankin 3 17 8
West 7 31 17
Whipray 4 4 0
Count of all fish caught
> table2 <- xtabs(total_count ~ Zone Year, data = df)
> table2
Year
Zone 2016 2017 2018
Crocodile 1 18 7
Rankin 14 46 69
West 29 67 58
Whipray 9 7 15
CodePudding user response:
I think I've managed to reverse-engineer your original data frame from the cross-tabs:
df
#> Zone Year total_count empty_count
#> 1 Crocodile 2016 1 0
#> 2 Rankin 2016 14 3
#> 3 West 2016 29 7
#> 4 Whipray 2016 9 4
#> 5 Crocodile 2017 18 8
#> 6 Rankin 2017 46 17
#> 7 West 2017 67 31
#> 8 Whipray 2017 7 4
#> 9 Crocodile 2018 7 2
#> 10 Rankin 2018 69 8
#> 11 West 2018 58 17
#> 12 Whipray 2018 15 0
It seems to me that rather than attempting multiple pairwise comparisons you could carry out a single logistic regression to find out where the significant differences are. Just ensure you have a "non-empty" count column, and that your years are factors:
df$non_empty_count <- df$total_count - df$empty_count
df$Year <- as.factor(df$Year)
And your logistic regression looks like this:
model <- glm(cbind(empty_count, non_empty_count) ~ Zone Year,
data = df, family = binomial)
summary(model)
#>
#> Call:
#> glm(formula = cbind(empty_count, non_empty_count) ~ Zone Year,
#> family = binomial, data = df)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.47563 -0.59548 0.04512 0.56564 1.26509
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.9588 0.5360 -1.789 0.0737 .
#> ZoneRankin -0.4864 0.4740 -1.026 0.3048
#> ZoneWest 0.1281 0.4552 0.281 0.7784
#> ZoneWhipray -0.1395 0.6038 -0.231 0.8173
#> Year2017 0.7975 0.3659 2.180 0.0293 *
#> Year2018 -0.3861 0.3831 -1.008 0.3136
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 39.255 on 11 degrees of freedom
#> Residual deviance: 11.800 on 6 degrees of freedom
#> AIC: 57.905
#>
#> Number of Fisher Scoring iterations: 5
Which you can interpret as saying that while the proportion of empty stomachs didn't vary significantly between sites, there was a significantly higher proportion of fish with empty stomachs in 2017 compared to other years across all sites.
Reproducible data
df <- structure(list(Zone = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L), .Label = c("Crocodile", "Rankin", "West",
"Whipray"), class = "factor"), Year = c(2016, 2016, 2016, 2016,
2017, 2017, 2017, 2017, 2018, 2018, 2018, 2018), total_count = c(1L,
14L, 29L, 9L, 18L, 46L, 67L, 7L, 7L, 69L, 58L, 15L), empty_count = c(0L,
3L, 7L, 4L, 8L, 17L, 31L, 4L, 2L, 8L, 17L, 0L)), row.names = c(NA,
-12L), class = "data.frame")
Created on 2022-02-09 by the reprex package (v2.0.1)