If the number of rows in a group exceeds X number of observations, randomly sample X number of rows-CodePudding

I need to reduce the number of rows in a data set. To do this my strategy is to the number of rows in a group exceeds X number of observations, randomly sample X number of rows from each group if the number of rows in a group exceeds X rows.

Assume the following data set:

set.seed(123)
n <- 10

df <- data.frame(id = c(1:n),
                 group = sample(1:3, n, replace = T))

> df
   id group
1   1     3
2   2     3
3   3     3
4   4     2
5   5     3
6   6     2
7   7     2
8   8     2
9   9     3
10 10     1

where X == 2. Let's count the number of rows in each group.

> table(df$group)

1 2 3 
1 4 5

This means that in the end result, I want 1 observation in groups one, and 2 in groups 2 and 3. The row that is kept in groups 2 and 3 should be randomly selected. This would reduce the data's size from 10 rows to 5.

How do I do this in an efficient way?

Thanks!

CodePudding user response：

Here is one way to group by group column and create a condition in slice to check if the number of rows (n()) is greater than 'X', sample the sequence of rows (row_number()) with X or else return row_number() (or sample in case X is different value

library(dplyr)
X <- 2
df %>% 
  group_by(group) %>% 
  slice(if(n() >= X) sample(row_number(), X, replace = FALSE) else 
     sample(row_number())) %>%
  ungroup

-output

# A tibble: 5 × 2
     id group
  <int> <int>
1    10     1
2     8     2
3     4     2
4     1     3
5     9     3