Home > OS >  How to split data into train and test while stratifying on labels and preventing the same entity fro
How to split data into train and test while stratifying on labels and preventing the same entity fro

Time:12-26

I have a dataset like a graph below. I want to split it to train and test stratified on labels. At the same time, I don't want the same player to appear both.

For example, when I split it train:test=1:1.

player utterances label
Bob ... 1
John ... 1
Mary ... 0
Kethy ... 1
Jack ... 1
John ... 0
John ... 1
Mary ... 1

train(label 0 : label 1 = 1 : 3)

player utterances label
Bob ... 1
John ... 1
John ... 0
John ... 1

test(label 0 : label 1 = 1 : 3)

player utterances label
Mary ... 0
Mary ... 1
Kethy ... 1
Jack ... 1

CodePudding user response:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

grouped = df.groupby('player')    
l=[grouped.get_group(x) for x in grouped.groups] # I have split dataframe via groupby

train,test =  train_test_split(l,test_size=0.5)
while len(pd.concat(train)) != len(pd.concat(test)):
    train,test =  train_test_split(l,test_size=0.5) # I've split it so that each contains an equal number of elements.

train = pd.concat(train)
test = pd.concat(test)

CodePudding user response:

Inspired by tako0707's answer, I split my data into train, valid and test like below.

Fortunately, train's, valid's and test's labels was almost stratified.

import pandas as pd

grouped = df.groupby('player')
groups = [grouped.get_group(x) for x in grouped.groups]
i = 0
train, train_size = [groups[i]], len(groups[i])

while train_size < len(labels) * 0.8:
    i  = 1
    train_size  = len(groups[i])
    train.append(groups[i])

test, test_size = [groups[i]], len(groups[i])
while test_size < len(labels)* 0.1:
    i  = 1
    test_size  = len(groups[i])
    test.append(groups[i])

valid, valid_size = [groups[i]], len(groups[i])
while valid_size < len(labels) * 0.1:
    i  = 1
    valid_size  = len(groups[i])
    valid.append(groups[i])

train.extend(groups[i 1:])

train, valid, test = pd.concat(train), pd.concat(valid), pd.concat(test)
  • Related