I need to reduce the number of rows in a data set. To do this my strategy is to the number of rows in a group exceeds X number of observations, randomly sample X
number of rows from each group if the number of rows in a group exceeds X
rows.
Assume the following data set:
set.seed(123)
n <- 10
df <- data.frame(id = c(1:n),
group = sample(1:3, n, replace = T))
> df
id group
1 1 3
2 2 3
3 3 3
4 4 2
5 5 3
6 6 2
7 7 2
8 8 2
9 9 3
10 10 1
where X == 2
. Let's count the number of rows in each group.
> table(df$group)
1 2 3
1 4 5
This means that in the end result, I want 1 observation in groups one, and 2 in groups 2 and 3. The row that is kept in groups 2 and 3 should be randomly selected. This would reduce the data's size from 10 rows to 5.
How do I do this in an efficient way?
Thanks!
CodePudding user response:
Here is one way to group by group column and create a condition in slice
to check if the number of rows (n()
) is greater than 'X', sample the sequence of rows (row_number()
) with X
or else return row_number()
(or sample in case X
is different value
library(dplyr)
X <- 2
df %>%
group_by(group) %>%
slice(if(n() >= X) sample(row_number(), X, replace = FALSE) else
sample(row_number())) %>%
ungroup
-output
# A tibble: 5 × 2
id group
<int> <int>
1 10 1
2 8 2
3 4 2
4 1 3
5 9 3