I have some data that I want to split into 4 equal parts based on the group.
My dataframe looks like this:
X | Group |
---|---|
1 | 1 |
2 | 1 |
3 | 1 |
4 | 1 |
5 | 1 |
6 | 1 |
7 | 2 |
8 | 2 |
9 | 3 |
10 | 3 |
11 | 3 |
12 | 3 |
13 | 3 |
14 | 3 |
15 | 3 |
16 | 3 |
Now I thought about adding a thrid column to mark which data belong to which split, like this:
X | Group | Split |
---|---|---|
1 | 1 | 1 |
2 | 1 | 3 |
3 | 1 | 2 |
4 | 1 | 4 |
5 | 1 | 4 |
6 | 1 | 2 |
7 | 2 | 3 |
8 | 2 | 1 |
9 | 3 | 1 |
10 | 3 | 2 |
11 | 3 | 3 |
12 | 3 | 4 |
13 | 3 | 1 |
14 | 3 | 2 |
15 | 3 | 3 |
16 | 3 | 4 |
I don't need to actually split the dataset, because the data are videos and I just have to mark how (which person) has to watch them.
I know how I can generate random numbers, but I need them to be stratified to the group.
I know how I can get a stratified sample, but thats not I want, because I want to distribute ALL data (videos in this case) but in a stratified fashion.
Can you help me how to achieve this?
Thank you!
edit: I changed to example to unequally sized groups.
CodePudding user response:
You can easily do these kind of stratified operations using dplyr::group_by()
:
library(tidyverse)
df <- data.frame(
X = 1:12,
Group = c(rep(1,4), rep(2,4), rep(3,4))
)
df %>%
group_by(Group) %>%
mutate(Split = sample(seq_along(X), size = n(), replace = FALSE) %% 4 1) %>%
ungroup()