I need to resample my survey. I have survey data from a country with 3 counties (1001, 1002, and 1003). The study surveyed 10 households in each county. I need to randomly pull the household id number of a smaller sample within each county.
The catch is that the % of household ids I need to pull varies from county to county. In other words, I need to extract a random sample of 25% of household ids in county 1001, 50% in county 1002, and 75% from county 1003.
Below is some mock data
set.seed(100)
mock.data <- data.frame(county= rep(c(1001:1003), each = 10),
household.id= sample(1000:4000, 30, replace=F))
Here is the proportion of household ids I need to pull from each county.
prop_to_sample <- data.frame(county=c(1001,1002,1003),
prop.households=c(0.25,0.50,0.75))
Below is the for loop command needed to extract household ids mock.data
with the household proportions from prop_to_sample
.
household.ids.saved <- NULL
counties.run <- unique(mock.data$county)
for (i in counties.run) {
ids <- mock.data %>%
filter(county== **county**) %>%
slice_sample(prop = **prop.households**) %>%
ungroup() %>%
pull(household.id)
household.ids.saved <- c(household.ids.saved, ids)
}
Thank you
CodePudding user response:
Create a list of proportions, split
your data.frame
by group, then use map2()
to sample by the specified proportion, returning only the household id.
library(dplyr)
library(purrr)
set.seed(100)
# Example data
mock.data <- data.frame(county = rep(c(1001:1003), each = 10),
household.id = sample(1000:4000, 30, replace = F))
# List of proportions to sample for each group
props <- list(0.1, 0.5, 0.9)
# Split into a list of data.frames by group, sample specified proportions for
# the group and keep only the household ID
split(mock.data, ~ county) %>%
map2(props,
~ .x %>%
slice_sample(prop = .y) %>%
pull(household.id))
#> $`1001`
#> [1] 1822
#>
#> $`1002`
#> [1] 1182 2330 2091 2816 1455
#>
#> $`1003`
#> [1] 2807 3675 1287 1970 1509 1604 1947 1346 2190
CodePudding user response:
You can do this in a number of ways.
Here is a dplyr
based approach, that uses group_map
and sample_frac
:
f <- function(x,y) {
p = with(prop_to_sample,prop.households[county==y$county])
slice_sample(x,prop = p)
}
bind_rows(group_map(group_by(mock.data,county),f,.keep = T))
Output:
# A tibble: 14 x 2
county household.id
<int> <int>
1 1001 1502
2 1001 2121
3 1002 3346
4 1002 2330
5 1002 3371
6 1002 3513
7 1002 2527
8 1003 3996
9 1003 1346
10 1003 2807
11 1003 3675
12 1003 1509
13 1003 1970
14 1003 1604
Here is a possible approach using data.table
library(data.table)
setDT(mock.data)
setDT(prop_to_sample)
mock.data[, sample(household.id, size = .N*(prop_to_sample[county==.BY, prop.households])), county]
Output:
county V1
1: 1001 2121
2: 1001 3885
3: 1002 2330
4: 1002 3346
5: 1002 3955
6: 1002 2527
7: 1002 2816
8: 1003 2190
9: 1003 1346
10: 1003 1604
11: 1003 1509
12: 1003 1947
13: 1003 1970
14: 1003 2807
Here is another approach, which uses apply()
over the rows of prop_to_sample
:
rbindlist(
apply(prop_to_sample,1,\(r) setDT(mock.data)[county==r[1]][sample(.N, .N*r[2])])
)