R: resample survey units at different proportions-CodePudding

I need to resample my survey. I have survey data from a country with 3 counties (1001, 1002, and 1003). The study surveyed 10 households in each county. I need to randomly pull the household id number of a smaller sample within each county.

The catch is that the % of household ids I need to pull varies from county to county. In other words, I need to extract a random sample of 25% of household ids in county 1001, 50% in county 1002, and 75% from county 1003.

Below is some mock data

set.seed(100)
mock.data <- data.frame(county= rep(c(1001:1003), each = 10),
                   household.id= sample(1000:4000, 30, replace=F))

Here is the proportion of household ids I need to pull from each county.

prop_to_sample <- data.frame(county=c(1001,1002,1003),
                             prop.households=c(0.25,0.50,0.75))

Below is the for loop command needed to extract household ids mock.data with the household proportions from prop_to_sample.

household.ids.saved <- NULL
counties.run <- unique(mock.data$county)
for (i in counties.run) {
ids <- mock.data %>%
  filter(county== **county**) %>%
  slice_sample(prop = **prop.households**) %>%
  ungroup() %>%
  pull(household.id)
household.ids.saved <- c(household.ids.saved, ids)
}

Thank you

CodePudding user response：

Create a list of proportions, split your data.frame by group, then use map2() to sample by the specified proportion, returning only the household id.

library(dplyr)
library(purrr)
set.seed(100)

# Example data
mock.data <- data.frame(county = rep(c(1001:1003), each = 10),
                        household.id = sample(1000:4000, 30, replace = F))

# List of proportions to sample for each group
props <- list(0.1, 0.5, 0.9)

# Split into a list of data.frames by group, sample specified proportions for
# the group and keep only the household ID
split(mock.data, ~ county) %>%
  map2(props,
       ~ .x %>%
         slice_sample(prop = .y) %>%
         pull(household.id))
#> $`1001`
#> [1] 1822
#> 
#> $`1002`
#> [1] 1182 2330 2091 2816 1455
#> 
#> $`1003`
#> [1] 2807 3675 1287 1970 1509 1604 1947 1346 2190

CodePudding user response：

You can do this in a number of ways.

Here is a dplyr based approach, that uses group_map and sample_frac:

f <- function(x,y) {
  p = with(prop_to_sample,prop.households[county==y$county])
  slice_sample(x,prop = p)
}
bind_rows(group_map(group_by(mock.data,county),f,.keep = T))

Output:

# A tibble: 14 x 2
   county household.id
    <int>        <int>
 1   1001         1502
 2   1001         2121
 3   1002         3346
 4   1002         2330
 5   1002         3371
 6   1002         3513
 7   1002         2527
 8   1003         3996
 9   1003         1346
10   1003         2807
11   1003         3675
12   1003         1509
13   1003         1970
14   1003         1604

Here is a possible approach using data.table

library(data.table)
setDT(mock.data)
setDT(prop_to_sample)

mock.data[, sample(household.id, size = .N*(prop_to_sample[county==.BY, prop.households])), county]

Output:

    county   V1
 1:   1001 2121
 2:   1001 3885
 3:   1002 2330
 4:   1002 3346
 5:   1002 3955
 6:   1002 2527
 7:   1002 2816
 8:   1003 2190
 9:   1003 1346
10:   1003 1604
11:   1003 1509
12:   1003 1947
13:   1003 1970
14:   1003 2807

Here is another approach, which uses apply() over the rows of prop_to_sample:

rbindlist(
  apply(prop_to_sample,1,\(r) setDT(mock.data)[county==r[1]][sample(.N, .N*r[2])])
)