I have a dataset like a graph below. I want to split it to train and test stratified on labels. At the same time, I don't want the same player to appear both.
For example, when I split it train:test=1:1.
player | utterances | label |
---|---|---|
Bob | ... | 1 |
John | ... | 1 |
Mary | ... | 0 |
Kethy | ... | 1 |
Jack | ... | 1 |
John | ... | 0 |
John | ... | 1 |
Mary | ... | 1 |
→
train(label 0 : label 1 = 1 : 3)
player | utterances | label |
---|---|---|
Bob | ... | 1 |
John | ... | 1 |
John | ... | 0 |
John | ... | 1 |
→
test(label 0 : label 1 = 1 : 3)
player | utterances | label |
---|---|---|
Mary | ... | 0 |
Mary | ... | 1 |
Kethy | ... | 1 |
Jack | ... | 1 |
CodePudding user response:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
grouped = df.groupby('player')
l=[grouped.get_group(x) for x in grouped.groups] # I have split dataframe via groupby
train,test = train_test_split(l,test_size=0.5)
while len(pd.concat(train)) != len(pd.concat(test)):
train,test = train_test_split(l,test_size=0.5) # I've split it so that each contains an equal number of elements.
train = pd.concat(train)
test = pd.concat(test)
CodePudding user response:
Inspired by tako0707's answer, I split my data into train, valid and test like below.
Fortunately, train's, valid's and test's labels was almost stratified.
import pandas as pd
grouped = df.groupby('player')
groups = [grouped.get_group(x) for x in grouped.groups]
i = 0
train, train_size = [groups[i]], len(groups[i])
while train_size < len(labels) * 0.8:
i = 1
train_size = len(groups[i])
train.append(groups[i])
test, test_size = [groups[i]], len(groups[i])
while test_size < len(labels)* 0.1:
i = 1
test_size = len(groups[i])
test.append(groups[i])
valid, valid_size = [groups[i]], len(groups[i])
while valid_size < len(labels) * 0.1:
i = 1
valid_size = len(groups[i])
valid.append(groups[i])
train.extend(groups[i 1:])
train, valid, test = pd.concat(train), pd.concat(valid), pd.concat(test)