Home > Software engineering >  Python: Split training dataset and testing data seet
Python: Split training dataset and testing data seet

Time:11-28

I need to split dataset into training a testing without using sklearn.model_selection.train_test_split.

I want the approach to be as follows:

  1. Read dataset from excel with 100 rows (DONE):

    data = pd.read_excel('file.xlsx')

  2. From the 100 rows, select 75% random rows as testing data (DONE):

    random_training = dataset.sample(75)

  3. Use a for loop to check which indexes exist in data list, but do not exist in random_training list. If not existing in random_training list, then put in list random_testing list. This is where I am finding it hard to execute. Any ideas?

CodePudding user response:

You can use DataLoader and SubsetRandomSampler and random.sample:

from torch.utils.data import DataLoader,SubsetRandomSampler
import random

indices = random.sample(range(1, len(dataset)), (int)(len(dataset)*0.75))
missing_indices = [index 
                    for index in range(0, len(dataset))
                    if index not in indices]
dl_valid = DataLoader(dataset,batch_size,sampler=SubsetRandomSampler(indices.astype("int")),num_workers = num_workers)
dl_train = DataLoader(dataset,batch_size,sampler=SubsetRandomSampler(missing_indices.astype("int")),num_workers = num_workers)

CodePudding user response:

tr=list(random_training.index)
testing=data.loc[data.index.drop(tr)]

CodePudding user response:

All good approaches, but the one suggested by @hyper-cookie seems the simplest and should work fine. I will use data.sample(frac=1) to first randomize the dataset and then select the first 75 rows for training and the last 25 for testing.

  • Related