I have a dataframe which I would like to split into one 75% and one 25% parts of the original. I thought a good first step would be to create the 25% dataset from the original dataset, by randomly sampling a quarter of the data.
However sampling shouldn't be entirely random, I want to preserve groups of a certain variable.
So with the example below, I want to randomly sample 1/4 of the data frame, but data needs to remain grouped via the 'team' variable. I have 8 teams, so I want to randomly sample 2 teams.
Data example (dput below)
team points assists
1 1 99 33
2 1 90 28
3 1 86 31
4 1 88 39
5 2 95 34
6 2 92 30
7 2 91 32
8 2 79 35
9 3 85 36
10 3 90 29
11 3 91 24
12 3 97 26
13 4 96 28
14 4 94 18
15 4 95 19
16 4 98 25
17 5 78 36
18 5 80 34
19 5 85 39
20 5 89 33
21 6 94 34
22 6 85 39
23 6 99 28
24 6 79 31
25 7 78 35
26 7 99 29
27 7 98 36
28 7 75 39
29 8 97 33
30 8 68 26
31 8 86 38
32 8 76 31
I've tried this using the slice_sample code from dplyr, but this does the exact opposite of what I want (it splits all teams)
testdata <- df %>% group_by(team) %>% slice_sample(n = 2)
My code results in
team points assists
<dbl> <dbl> <dbl>
1 1 90 28
2 1 99 33
3 2 95 34
4 2 92 30
5 3 91 24
6 3 85 36
7 4 95 19
8 4 98 25
9 5 80 34
10 5 78 36
11 6 85 39
12 6 94 34
13 7 78 35
14 7 98 36
15 8 76 31
16 8 86 38
Example of the dataframe:
structure(list(team = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4,
4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8), points = c(99,
90, 86, 88, 95, 92, 91, 79, 85, 90, 91, 97, 96, 94, 95, 98, 78,
80, 85, 89, 94, 85, 99, 79, 78, 99, 98, 75, 97, 68, 86, 76),
assists = c(33, 28, 31, 39, 34, 30, 32, 35, 36, 29, 24, 26,
28, 18, 19, 25, 36, 34, 39, 33, 34, 39, 28, 31, 35, 29, 36,
39, 33, 26, 38, 31)), class = "data.frame", row.names = c(NA,
-32L))
CodePudding user response:
With dplyr
, if you group_by(team)
and then sample, that's sampling within each team--the opposite of what you want. Here's a direct approach:
test_teams = sample(unique(dataset$team), size = 2)
test = dataset %>% filter(team %in% test_teams)
train = dataset %>% filter(!team %in% test_teams)
CodePudding user response:
library(caTools)
split <- sample.split(dataset$team, SplitRatio = 0.75)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)