I have to compare the distribution of "Pct" (some variabel in the data frame) for male and female cats (which are under the same variabel "Sex").
The problem occurs because i there are different size observations of male and female cats:
kat %>% group_by(Sex) %>%
summarize(count = n())
# A tibble: 2 × 2
Sex count
<chr> <int>
1 F 47
2 M 97
Just for some more information:
head(kat)
Sex Bwt Hwt Pct
1 F 2.0 7.0 0.3500000
2 F 2.0 7.4 0.3700000
3 F 2.0 9.5 0.4750000
4 F 2.1 7.2 0.3428571
5 F 2.1 7.3 0.3476190
6 F 2.1 7.6 0.3619048
For me to make a QQ-plot i know the length of observations must not differ.
What do i do?
I have both searched here and Google, but i can't seem to find any relevant information because i keep running into a dead end.
Please let me know if more information for the solution is required.
CodePudding user response:
Could you do like below, where you calculate a bunch of relevant quantile values for each group and then plot them against each other:
library(tidyr)
library(dplyr)
library(ggplot2)
dat <- data.frame(sex=rep(c("male", "female"), c(47, 67)),
pct = rnorm(114, 50, 20))
qdat <- dat %>%
group_by(sex) %>%
summarise(data.frame(
pctile = seq(.05, .95, by=.05),
q = quantile(pct, seq(.05, .95, by=.05)))) %>%
unnest(q) %>%
pivot_wider(names_from = "sex", values_from = "q")
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.
ggplot(qdat, aes(x=male, y=female))
geom_point()
geom_abline(intercept=0, slope=1, linetype=2)
theme_classic()
Created on 2022-11-24 by the reprex package (v2.0.1)
In the plot above, the point in the lower-left corner is the value of the 5th percentile for males plotted against the 5th percentile value for females. If they come from the same distribution, they should be pretty close to the 45-degree line.