Home > other >  Sample part of a dataset while keeping subgroups intact
Sample part of a dataset while keeping subgroups intact

Time:09-09

I have a dataframe which I would like to split into one 75% and one 25% parts of the original. I thought a good first step would be to create the 25% dataset from the original dataset, by randomly sampling a quarter of the data.

However sampling shouldn't be entirely random, I want to preserve groups of a certain variable.

So with the example below, I want to randomly sample 1/4 of the data frame, but data needs to remain grouped via the 'team' variable. I have 8 teams, so I want to randomly sample 2 teams.

Data example (dput below)

   team points assists
1     1     99      33
2     1     90      28
3     1     86      31
4     1     88      39
5     2     95      34
6     2     92      30
7     2     91      32
8     2     79      35
9     3     85      36
10    3     90      29
11    3     91      24
12    3     97      26
13    4     96      28
14    4     94      18
15    4     95      19
16    4     98      25
17    5     78      36
18    5     80      34
19    5     85      39
20    5     89      33
21    6     94      34
22    6     85      39
23    6     99      28
24    6     79      31
25    7     78      35
26    7     99      29
27    7     98      36
28    7     75      39
29    8     97      33
30    8     68      26
31    8     86      38
32    8     76      31

I've tried this using the slice_sample code from dplyr, but this does the exact opposite of what I want (it splits all teams) testdata <- df %>% group_by(team) %>% slice_sample(n = 2)

My code results in

    team points assists
   <dbl>  <dbl>   <dbl>
 1     1     90      28
 2     1     99      33
 3     2     95      34
 4     2     92      30
 5     3     91      24
 6     3     85      36
 7     4     95      19
 8     4     98      25
 9     5     80      34
10     5     78      36
11     6     85      39
12     6     94      34
13     7     78      35
14     7     98      36
15     8     76      31
16     8     86      38

Example of the dataframe:

structure(list(team = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 
4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8), points = c(99, 
90, 86, 88, 95, 92, 91, 79, 85, 90, 91, 97, 96, 94, 95, 98, 78, 
80, 85, 89, 94, 85, 99, 79, 78, 99, 98, 75, 97, 68, 86, 76), 
    assists = c(33, 28, 31, 39, 34, 30, 32, 35, 36, 29, 24, 26, 
    28, 18, 19, 25, 36, 34, 39, 33, 34, 39, 28, 31, 35, 29, 36, 
    39, 33, 26, 38, 31)), class = "data.frame", row.names = c(NA, 
-32L))

CodePudding user response:

With dplyr, if you group_by(team) and then sample, that's sampling within each team--the opposite of what you want. Here's a direct approach:

test_teams = sample(unique(dataset$team), size = 2)
test = dataset %>% filter(team %in% test_teams)
train = dataset %>% filter(!team %in% test_teams)

CodePudding user response:

library(caTools)
split <- sample.split(dataset$team, SplitRatio = 0.75)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)
  • Related