Re sampling a data to match the population profile using two demographic variables ( sex and age) (u-CodePudding

I'm struggling with what I imagine is a multi-level sampling procedure in R. Let's say I have a dataset composed of a very biased sampling method. Therefore, the results obtained with the participants are biased. I would like to adjust the dataset to match two demographic variables (sex and age), which are coded as factor in the dataset. The following image described the situation.

I assume that I'll need to perform a "loop" calculation. As an example: to adjust the sample size of the first age interval (15-19), I'll need to define a new total in which this final sample fits the 50% 50% definition. The same procedure will be needed for all other age intervals.

That's the most related topic I've found.

x<-structure(list(age_cat = c("25-29", "30-34", "25-29", "20-24", 
                              "25-29", "20-24", "35-39", "30-34", "25-29", "30-34", "25-29", 
                              "30-34", "35-39", "45-49", "40-45", "20-24", "20-24", "25-29", 
                              "35-39", "35-39", "25-29", "20-24", "30-34", "30-34", "40-45", 
                              "25-29", "25-29", "25-29", "20-24", "40-45", "20-24", "40-45", 
                              "30-34", "25-29", "45-49", "30-34", "45-49", "40-45", "25-29", 
                              "35-39", "40-45", "25-29", "45-49", "35-39", "45-49", "40-45", 
                              "20-24", "45-49", "40-45", "25-29", "35-39", "30-34", "30-34", 
                              "25-29", "20-24", "20-24", "40-45", "35-39", "25-29", "25-29", 
                              "20-24", "40-45", "20-24", "20-24", "45-49", "20-24", "35-39", 
                              "20-24", "35-39", "45-49", "15-19", "45-49", "35-39", "35-39", 
                              "30-34", "35-39", "45-49", "35-39", "30-34", "20-24", "35-39", 
                              "40-45", "40-45", "40-45", "30-34", "45-49", "20-24", "30-34", 
                              "45-49", "35-39", "20-24", "20-24", "20-24", "45-49", "20-24", 
                              "45-49", "35-39", "25-29", "40-45", "40-45", "25-29", "35-39", 
                              "45-49", "30-34", "45-49", "45-49", "45-49", "15-19", "30-34", 
                              "45-49", "30-34", "30-34", "35-39", "25-29", "40-45", "15-19", 
                              "20-24", "20-24", "40-45", "40-45", "45-49", "45-49", "35-39", 
                              "40-45", "30-34", "35-39", "35-39", "25-29", "25-29", "20-24", 
                              "20-24", "40-45", "20-24", "35-39", "20-24", "20-24", "30-34", 
                              "25-29", "45-49", "25-29", "35-39", "20-24", "35-39", "35-39", 
                              "35-39", "40-45", "35-39", "35-39", "20-24", "30-34", "25-29", 
                              "15-19", "30-34", "35-39", "15-19", "20-24", "20-24", "35-39", 
                              "25-29", "25-29", "25-29", "25-29", "30-34", "40-45", "35-39", 
                              "30-34", "35-39", "40-45", "25-29", "30-34", "25-29", "25-29", 
                              "45-49", "30-34", "30-34", "25-29", "15-19", "25-29", "20-24", 
                              "15-19", "20-24", "30-34", "20-24", "40-45", "25-29", "25-29", 
                              "30-34", "30-34", "25-29", "20-24", "40-45", "45-49", "25-29", 
                              "25-29", "40-45", "35-39", "25-29", "45-49", "35-39", "30-34", 
                              "45-49", "30-34", "30-34", "45-49", "35-39", "20-24", "45-49", 
                              "30-34", "25-29", "45-49", "45-49", "40-45", "25-29", "20-24", 
                              "40-45", "30-34", "35-39", "30-34", "20-24", "35-39", "20-24", 
                              "30-34", "20-24", "35-39", "35-39", "30-34", "45-49", "40-45", 
                              "45-49", "25-29", "35-39", "40-45", "30-34", "35-39", "30-34", 
                              "35-39", "20-24", "25-29", "35-39", "30-34", "30-34", "25-29", 
                              "45-49", "45-49", "40-45", "40-45", "35-39", "30-34", "25-29", 
                              "35-39", "20-24", "40-45", "20-24", "30-34", "40-45", "20-24", 
                              "45-49", "20-24", "40-45", "25-29", "40-45", "25-29", "45-49", 
                              "30-34", "30-34", "45-49", "40-45", "30-34", "30-34", "20-24", 
                              "20-24", "35-39", "30-34", "15-19", "35-39", "25-29", "45-49", 
                              "30-34", "25-29", "35-39", "15-19", "40-45", "45-49", "15-19", 
                              "35-39", "45-49", "45-49", "25-29"), sex_cat = structure(c(1L, 
                                                                                         2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 
                                                                                         1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 
                                                                                         2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 
                                                                                         2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 
                                                                                         1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 
                                                                                         1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 
                                                                                         1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 
                                                                                         1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 
                                                                                         1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 
                                                                                         1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 
                                                                                         2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 
                                                                                         1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 
                                                                                         1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 
                                                                                         2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 
                                                                                         1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
                                                                                         1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 
                                                                                         1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 
                                                                                         2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("M", 
                                                                                                                                                                 "F"), class = "factor")), row.names = c(NA, -288L), class = c("tbl_df", 
                                                                                                                                                                                                                               "tbl", "data.frame"))

CodePudding user response：

Okay so this was a bit of a doozie! Here is what I did:

library(tidyverse)
library(data.table)
library(splitstackshape)

x <- x %>% mutate(id = row_number(),
                  sex_cats = paste("N", sex_cat, sep = "_"))

x_dt <- data.table(x)
x_cts <- x %>% group_by(age_cat, sex_cat) %>% summarise(n = n()) %>% ungroup(sex_cat)
x_raw <- data.frame(age_cat  = rep(unique(x_cts$age_cat), each = 2), 
                    sex_cat  = rep(unique(x_cts$sex_cat), times = length(unique(x_cts$age_cat))),
                    percents = c(0.5, 0.5, 0.8, 0.2, 0.34, 0.66, 0.5, 0.5, 0.75, 0.25, 0.5, 0.5, 0.6, 0.4)

x_raw_wd <- x_raw %>% pivot_wider(names_from = sex_cat, values_from = percents, names_prefix = "per_")

x_raw_wd <- x_raw_wd %>% mutate(N_M = round(per_M * total_n),
                                N_F = round(per_F * total_n))

x_raw_wd$total_n <- c(6, 30, 30, 30, 20, 10, 20)

x_raw_wd_fin <- x_raw_wd %>% 
                select(age_cat, N_M, N_F) %>% 
                pivot_longer(cols = starts_with("N_"), names_to = "sex_cats") %>% 
                arrange(age_cat, sex_cats) 

x_raw_wd_dt <- data.table(x_raw_wd_fin)

stratified(x_dt[, KEY := paste(age_cat, sex_cats)], "KEY", keep.rownames = T, 
           with(x_raw_wd_dt, setNames(value, paste(age_cat, sex_cats))))

There are people better than me at using data.table but what I did here, was first create an id column and sex_cats. sex_cats is used later but keep this here for now. x_cts was created to check and make sure the data you sent was copied and pasted correctly.

Then I create x_raw which is a simulated version of the request; here we include for each age_cat and sex_cat a percents for each sex_cat within each age_cat. These have to add up to 100%.

Then I pivot_wider to get the percents into wide format across each sex_cat. Then I simulate the number of samples you want from each age_cat: this is manually inserted so if you need to change the number for each age_cat, feel free to. From here we calculate for each sex_cat the total number of samples in x_raw_wd.

Then we get this in long format because of the requirements for the function stratified from splitstackshape. If you look at the names_to option, this is shifted to N_M or N_F, which is different than sex_cat (sex_cat = 'M', 'F'). That's why in the beginning we created sex_cats.

Finally, we submit everything into stratified. We create a KEY column to link our x_raw_wd_fin$value, which is total number of samples required by age_cat and sex_cat, to the combination of age_cat and sex_cat for each observation in x.

Based on my percentages, mostly made-up for demonstration purposes, I need 146 samples.

Here is my output:

     age_cat sex_cat  id paste("N", sex_cat)       KEY sex_cats
  1:   15-19       F 281                 N F 15-19 N_F      N_F
  2:   15-19       F 155                 N F 15-19 N_F      N_F
  3:   15-19       F 177                 N F 15-19 N_F      N_F
  4:   15-19       M 108                 N M 15-19 N_M      N_M
  5:   15-19       M 284                 N M 15-19 N_M      N_M
 ---                                                           
142:   45-49       M 105                 N M 45-49 N_M      N_M
143:   45-49       M  37                 N M 45-49 N_M      N_M
144:   45-49       M 207                 N M 45-49 N_M      N_M
145:   45-49       M 173                 N M 45-49 N_M      N_M
146:   45-49       M 103                 N M 45-49 N_M      N_M