I have a dataset containing the number of infants born per gestational week.
I am trying to determine the median gestational age of delivery based on the frequency of infants born for this particular year
For example:
GA | num_infants_born |
---|---|
20 weeks | 16 |
21 weeks | 22 |
22 weeks | 34 |
23 weeks | 45 |
24 weeks | 60 |
25 weeks | 67 |
26 weeks | 94 |
and onwards, until 41 weeks. The distribution is (not surprisingly) left skewed
I also calculated cumulative frequencies using
data$cumulative_freq = cumsum(data$num_infants_born)
Do I use the cumulative_freq column to calculate the median number of infants born that corresponds to a gestational week? Using
median(medianGA2001a$cumulative_freq)
gives me an unexpected number.
I am expecting the median GA to be around 35 weeks, based on the distribution
CodePudding user response:
If I understood your question correctly you're going to want to do something like this:
# Your gestational data:
gestational_data <- data.frame(GA_weeks = c(20:26),
num_infants_born = c(16,22,34,45,60,67,94))
# See the apply() documentation by running
# ?apply
apply(gestational_data,
1,
function(x){
rep(x[1],x[2])
}) |>
unlist()|>
median()
CodePudding user response:
What you want is a weighted median. You first want the weeks as numeric, which you get using gsub
if not yet available
dat$GA_num <- as.numeric(gsub('\\D', '', dat$GA))
Then, use weightedMedian
from the matrixStats
package with the number of infants as weights.
matrixStats::weightedMedian(dat$GA_num, w=dat$num_infants_born)
# [1] 24.34646
Note, that there are several definitions of the weighted mean. For a comprehensive discussion, see this answer.
Data:
dat <- structure(list(GA = c("20 weeks", "21 weeks", "22 weeks", "23 weeks",
"24 weeks", "25 weeks", "26 weeks"), num_infants_born = c(16L,
22L, 34L, 45L, 60L, 67L, 94L)), class = "data.frame", row.names = c(NA,
-7L))