I need to split dataset into training a testing without using sklearn.model_selection.train_test_split.
I want the approach to be as follows:
Read dataset from excel with 100 rows (DONE):
data = pd.read_excel('file.xlsx')
From the 100 rows, select 75% random rows as testing data (DONE):
random_training = dataset.sample(75)
Use a for loop to check which indexes exist in data list, but do not exist in random_training list. If not existing in random_training list, then put in list random_testing list. This is where I am finding it hard to execute. Any ideas?
CodePudding user response:
You can use DataLoader
and SubsetRandomSampler
and random.sample
:
from torch.utils.data import DataLoader,SubsetRandomSampler
import random
indices = random.sample(range(1, len(dataset)), (int)(len(dataset)*0.75))
missing_indices = [index
for index in range(0, len(dataset))
if index not in indices]
dl_valid = DataLoader(dataset,batch_size,sampler=SubsetRandomSampler(indices.astype("int")),num_workers = num_workers)
dl_train = DataLoader(dataset,batch_size,sampler=SubsetRandomSampler(missing_indices.astype("int")),num_workers = num_workers)
CodePudding user response:
tr=list(random_training.index)
testing=data.loc[data.index.drop(tr)]
CodePudding user response:
All good approaches, but the one suggested by @hyper-cookie seems the simplest and should work fine. I will use data.sample(frac=1) to first randomize the dataset and then select the first 75 rows for training and the last 25 for testing.