Median of grouped data-CodePudding

I have a dataset containing the number of infants born per gestational week.

I am trying to determine the median gestational age of delivery based on the frequency of infants born for this particular year

For example:

GA	num_infants_born
20 weeks	16
21 weeks	22
22 weeks	34
23 weeks	45
24 weeks	60
25 weeks	67
26 weeks	94

and onwards, until 41 weeks. The distribution is (not surprisingly) left skewed

I also calculated cumulative frequencies using

data$cumulative_freq = cumsum(data$num_infants_born)

Do I use the cumulative_freq column to calculate the median number of infants born that corresponds to a gestational week? Using

median(medianGA2001a$cumulative_freq)

gives me an unexpected number.

I am expecting the median GA to be around 35 weeks, based on the distribution

CodePudding user response：

If I understood your question correctly you're going to want to do something like this:

# Your gestational data:
gestational_data <- data.frame(GA_weeks = c(20:26),
                               num_infants_born = c(16,22,34,45,60,67,94))

# See the apply() documentation by running 
# ?apply

apply(gestational_data,
      1,
      function(x){
        rep(x[1],x[2])
      }) |>
  unlist()|>
  median()

CodePudding user response：

What you want is a weighted median. You first want the weeks as numeric, which you get using gsub if not yet available

dat$GA_num <- as.numeric(gsub('\\D', '', dat$GA))

Then, use weightedMedian from the matrixStats package with the number of infants as weights.

matrixStats::weightedMedian(dat$GA_num, w=dat$num_infants_born)
# [1] 24.34646

Note, that there are several definitions of the weighted mean. For a comprehensive discussion, see this answer.

Data:

dat <- structure(list(GA = c("20 weeks", "21 weeks", "22 weeks", "23 weeks", 
"24 weeks", "25 weeks", "26 weeks"), num_infants_born = c(16L, 
22L, 34L, 45L, 60L, 67L, 94L)), class = "data.frame", row.names = c(NA, 
-7L))