Home > Software design >  Median of grouped data
Median of grouped data

Time:12-07

I have a dataset containing the number of infants born per gestational week.

I am trying to determine the median gestational age of delivery based on the frequency of infants born for this particular year

For example:

GA num_infants_born
20 weeks 16
21 weeks 22
22 weeks 34
23 weeks 45
24 weeks 60
25 weeks 67
26 weeks 94

and onwards, until 41 weeks. The distribution is (not surprisingly) left skewed

I also calculated cumulative frequencies using

data$cumulative_freq = cumsum(data$num_infants_born) 

Do I use the cumulative_freq column to calculate the median number of infants born that corresponds to a gestational week? Using

median(medianGA2001a$cumulative_freq)

gives me an unexpected number.

I am expecting the median GA to be around 35 weeks, based on the distribution

CodePudding user response:

If I understood your question correctly you're going to want to do something like this:

# Your gestational data:
gestational_data <- data.frame(GA_weeks = c(20:26),
                               num_infants_born = c(16,22,34,45,60,67,94))

# See the apply() documentation by running 
# ?apply

apply(gestational_data,
      1,
      function(x){
        rep(x[1],x[2])
      }) |>
  unlist()|>
  median()

CodePudding user response:

What you want is a weighted median. You first want the weeks as numeric, which you get using gsub if not yet available

dat$GA_num <- as.numeric(gsub('\\D', '', dat$GA))

Then, use weightedMedian from the matrixStats package with the number of infants as weights.

matrixStats::weightedMedian(dat$GA_num, w=dat$num_infants_born)
# [1] 24.34646

Note, that there are several definitions of the weighted mean. For a comprehensive discussion, see this answer.


Data:

dat <- structure(list(GA = c("20 weeks", "21 weeks", "22 weeks", "23 weeks", 
"24 weeks", "25 weeks", "26 weeks"), num_infants_born = c(16L, 
22L, 34L, 45L, 60L, 67L, 94L)), class = "data.frame", row.names = c(NA, 
-7L))
  • Related