Sampling rows from a grouped data.frame conditional on group-level summary statistics with dplyr-CodePudding

In this post on sampling a proportion with a lower bound of the number of rows sampled I wrote a function (see below) that takes a data.frame containing some group identifier(s), splits the data.frame by group into a list, and then samples the greater of a proportion and a minimum number of rows.

While this works, I was wondering if there is an efficient way to do this with summarise or otherwise not splitting the output of group_by() into a list and then iterating across the elements of the list with map/lapply-like functions. The idea would be to pass the data to group_by() and then to summarise(), where I would count the number of rows in each group and then sample the proportion or minimal number accordingly using an if_else approach. However I was finding that produced various scoping issues or type conflicts. For example, cur_group or cur_data seem useful to count and subset in the same summarise call, but I'm not sure how to use them properly.

Anyone have any idea for how to do this within summarise() or otherwise avoid split()ing the data outside of summarise()?

library(dplyr)

# Example data: 10 rows in group a, 100 in group b
df <- data.frame(x = 1:110,
                 y = rnorm(110),
                 group = c(rep("a", 10), rep("b", 100)))

# Proportion and minimum number of rows to sample
sample_prop <- 0.5
sample_min <- 8

# Group the data and split each group into a list of tibbles
df_list <- df %>% group_by(group) %>% group_split()

# Checks if the number of rows that would be sampled is below the minimum. If so, 
# sample the minimum number of rows, otherwise sample the proportion. This is 
# what I'm trying to do within a summarise call.
conditional_sample <- function(dat, sample_min, sample_prop) {
  if (nrow(dat) * sample_prop < sample_min) {
    slice_sample(dat, n = sample_min)
  } else{
    slice_sample(dat, prop = sample_prop)
  }
}

# Apply the function to our list -- ideally this would be unecessary
# within summarise
sampled <- df_list %>%
  lapply(., function(x) {
    conditional_sample(x, sample_min, sample_prop)
  })

bind_rows(sampled) # check out data

CodePudding user response：

A simple way is to use the max() of sample_min and sample_prop * n() as the sample size:

With slice():

library(dplyr)

sample_prop <- 0.5
sample_min <- 8


df %>%
  group_by(group) %>%
  slice(sample(n(), max(sample_min, floor(sample_prop * n())))) %>%
  ungroup()

# A tibble: 58 × 3
       x      y group
   <int>  <dbl> <chr>
 1     1  1.01  a    
 2     3 -0.389 a    
 3     4  0.559 a    
 4     5 -0.594 a    
 5     7 -0.415 a    
 6     8 -1.63  a    
 7     9 -2.27  a    
 8    10 -0.422 a    
 9    11  0.673 b    
10    12 -1.23  b    
# … with 48 more rows
# ℹ Use `print(n = ...)` to see more rows

Or equivalently with filter():

df %>%
  group_by(group) %>%
  filter(row_number() %in% sample(n(), max(sample_min, floor(sample_prop * n())))) %>%
  ungroup()