How to split data into train and test while stratifying on labels and preventing the same entity fro-CodePudding

I have a dataset like a graph below. I want to split it to train and test stratified on labels. At the same time, I don't want the same player to appear both.

For example, when I split it train:test=1:1.

player	utterances	label
Bob	...	1
John	...	1
Mary	...	0
Kethy	...	1
Jack	...	1
John	...	0
John	...	1
Mary	...	1

→

train(label 0 : label 1 = 1 : 3)

player	utterances	label
Bob	...	1
John	...	1
John	...	0
John	...	1

→

test(label 0 : label 1 = 1 : 3)

player	utterances	label
Mary	...	0
Mary	...	1
Kethy	...	1
Jack	...	1

CodePudding user response：

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

grouped = df.groupby('player')    
l=[grouped.get_group(x) for x in grouped.groups] # I have split dataframe via groupby

train,test =  train_test_split(l,test_size=0.5)
while len(pd.concat(train)) != len(pd.concat(test)):
    train,test =  train_test_split(l,test_size=0.5) # I've split it so that each contains an equal number of elements.

train = pd.concat(train)
test = pd.concat(test)

CodePudding user response：

Inspired by tako0707's answer, I split my data into train, valid and test like below.

Fortunately, train's, valid's and test's labels was almost stratified.

import pandas as pd

grouped = df.groupby('player')
groups = [grouped.get_group(x) for x in grouped.groups]
i = 0
train, train_size = [groups[i]], len(groups[i])

while train_size < len(labels) * 0.8:
    i  = 1
    train_size  = len(groups[i])
    train.append(groups[i])

test, test_size = [groups[i]], len(groups[i])
while test_size < len(labels)* 0.1:
    i  = 1
    test_size  = len(groups[i])
    test.append(groups[i])

valid, valid_size = [groups[i]], len(groups[i])
while valid_size < len(labels) * 0.1:
    i  = 1
    valid_size  = len(groups[i])
    valid.append(groups[i])

train.extend(groups[i 1:])

train, valid, test = pd.concat(train), pd.concat(valid), pd.concat(test)