Is there a way to filter a fixed percentage sample from a specific categoric variable?-CodePudding

Say I have a population of 1000 patients with data of their sex. I'm being asked to draw a sample of size n that meets strictly that 65% of them must be male.

Some sample data (in here, the sex distribution is 50%-50%):

data <- data.frame(patient_id = 1:1000,
               sex = append(rep("male", 500),
                            rep("female", 500))
                   )

Can't really see a way to solve this task using sample_n or sample_frac in dplyr.

Result data should be something like this for n = 500, but with random patient_ids.

data.frame(patient_id = 1:500,
           sex = append(rep("male", 325),
                        rep("female", 175))
           )

Any insight is appreciated.

CodePudding user response：

We can use bind_rows and filter them separately. First, let's set the values for the number of rows so that it can give flexibility if you want to change the percentage:

library(tidyverse)

number_of_sample <- 500

male_pct <- 0.65

number_of_male <- number_of_sample * male_pct

number_of_female <- number_of_sample - number_of_male

#For reproducibility setting the seed
set.seed(4)

data %>%
  filter(sex=='male') %>%
  sample_n(size = number_of_male) %>%
  bind_rows(data %>%
              filter(sex=='female') %>%
              sample_n(size = number_of_female))-> sampled_data

Checking the numbers:

sampled_data %>%
  group_by(sex) %>%
  summarise(count=n())

# A tibble: 2 x 2
  sex    count
  <chr>  <int>
1 female   175
2 male     325

CodePudding user response：

Another tidyverse option.

library(dplyr)

n <- 150

df <- mutate(data, patient_id = sample(patient_id))

view <- filter(df, sex == 'male')[1:round(n*0.65),] %>%
  bind_rows(filter(df, sex == 'female')[1:round(n*0.35),])

Counting the rows gives us:

count(view, sex)

#      sex  n
# 1 female 52
# 2   male 98

CodePudding user response：

This is an alternative solution using nesting of data in one pipeline. The proportions would need changed if you aren't using a 50/50 split.

library(tidyverse)
sampled_data = data %>% 
  group_by(sex) %>% 
  nest() %>% 
  ungroup() %>% 
  mutate(prop = c(0.65, 0.35)) %>% 
  mutate(samples = map2(data, prop, sample_frac)) %>% 
  select(-data, - prop) %>% 
  unnest(samples)

sampled_data %>% count(sex)

# A tibble: 2 × 2
  sex        n
  <fct>  <int>
1 female   175
2 male     325