Home > database >  Sample a proportion of each group, BUT with a minimum constraint (using dplyr)
Sample a proportion of each group, BUT with a minimum constraint (using dplyr)

Time:06-03

I have a population of 6 categories (stratum) and I want in each stratum to take the 10% as a sample. Doing so I take:

var = c(rep("A",10),rep("B",10),rep("C",3),rep("D",5),"E","F");var
value = rnorm(30)
dat = tibble(var,value);
pop=dat%>%group_by(var)
pop
singleallocperce = slice_sample(pop, prop=0.1);
singleallocperce

with result:

# A tibble: 2 x 2
# Groups:   var [2]
  var   value
  <chr> <dbl>
1 A     -1.54
2 B     -1.12

But I want even if in some stratum that the polupation inside them cannot reach the taken sample of 10% to take at least one observation.How can I do it this in R using dplyr package?

CodePudding user response:

Possible approach (note the presence of 20 x A to check two are returned).

library(tidyverse)

# Data (note 20 As)
var = c(rep("A",20),rep("B",10),rep("C",3),rep("D",5),"E","F")
value = rnorm(40)
dat = tibble(var, value)

# Possible approach
dat %>%
  group_by(var) %>%
  mutate(min = if_else(n() * 0.1 >= 1, n() * 0.1, 1),
         random = sample(n())) %>%
  filter(random <= min) |> 
  select(var, value)
#> # A tibble: 7 × 2
#> # Groups:   var [6]
#>   var     value
#>   <chr>   <dbl>
#> 1 A      0.0105
#> 2 A      0.171 
#> 3 B     -1.89  
#> 4 C      1.89  
#> 5 D      0.612 
#> 6 E      0.516 
#> 7 F      0.185

Created on 2022-06-02 by the reprex package (v2.0.1)

CodePudding user response:

Here is a potential solution:

sample_func <- function(data) {
  standard <- data %>% 
    group_by(var) %>% 
    slice_sample(prop = 0.1) %>% 
    ungroup()
  
  if(!all(unique(data$var) %in% unique(standard$var))) {
    mins <- data %>% 
      filter(!var %in% standard$var) %>% 
      group_by(var) %>% 
      slice(1) %>% 
      ungroup()
  }
  
  bind_rows(standard, mins) 
  
}

sample_func(dat)

Which gives:

  var     value
  <chr>   <dbl>
1 A      1.36  
2 B     -1.03  
3 C     -0.0450
4 D     -0.380 
5 E     -0.0556
6 F      0.519 

The assumption is that if you are sampling proportionally and do not have any sample for var, that the minimum threshold would be sampling one record from var (by using slice(1)).

CodePudding user response:

data.table

library(data.table)

setDT(dat) # make the tibble a data.table

dat[, .SD[sample((1:.N), fifelse(.N >= 10, .N %/% 10, 1))], var]

results

   var     value
1:   A -0.040487
2:   A  0.543354
3:   B -1.100892
4:   C  0.998006
5:   D  0.496898
6:   E  0.819967
7:   F  0.629236

data

# Data (note 20 As)
var = c(rep("A",20),rep("B",10),rep("C",3),rep("D",5),"E","F")
value = rnorm(40)
dat = tibble(var, value)
  • Related