Why does my kernel die every time I run train-test split on this particular dataset?-CodePudding

I've used train-test split before and haven't had any issues. I have a rather large (1GB) dataset for my CNN and tried using it, and my kernel dies every time. I've read that sometimes it helps to enter shuffle=False. I tried that with no luck. I've included my code below. Any help would be appreciated!!

import pandas as pd
import os
import cv2
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from PIL import Image
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import accuracy_score
np.random.seed(42)
data_dir='birds/'
train_path=data_dir '/train'
test_path=data_dir '/test'
img_size=(100,100)
channels=3
num_categories=len(os.listdir(train_path))
#get list of each category to zip
names_of_species=[]

for i in os.listdir(train_path):
    names_of_species.append(i)

#make list of numbers from 1-300:
num_list=[]
for i in range(300):
    num_list.append(i)
nums_and_names=dict(zip(num_list, names_of_species))
folders=os.listdir(train_path)
import random
from matplotlib.image import imread
df=pd.read_csv(data_dir '/Bird_Species.csv')

img_data=[]
img_labels=[]

for i in nums_and_names:
    path=data_dir 'train/' str(names_of_species[i])
    images=os.listdir(path)
    
    for img in images:
        try:
            image=cv2.imread(path '/' img)
            image_fromarray=Image.fromarray(image, 'RGB')
            resize_image=image_fromarray.resize((img_size))
            img_data.append(np.array(resize_image))
            img_labels.append(num_list[i])
        except:
            print("Error in " img)
img_data=np.array(img_data)
img_labels=np.array(img_labels)
img_labels
array([210,  41, 148, ...,  15, 115, 292])
#SHUFFLE TRAINING DATA

shuffle_indices=np.arange(img_data.shape[0])
np.random.shuffle(shuffle_indices)
img_data=img_data[shuffle_indices]
img_labels=img_labels[shuffle_indices]
#Split the data

X_train, X_test, y_train, y_test=train_test_split(img_data,img_labels, test_size=0.2,random_state=42, shuffle=False)

#Resize data
X_train=X_train/255
X_val=X_val/255

CodePudding user response：

This means that you are probably running out of RAM or GPU memory.

To check on Windows open Task Manager (ctrl shft esc), go to performance run the code, and check the RAM usage and GPU memory usage to determine if the cause was either of them.

Note: To monitor GPU memory you should monitor "Dedicated GPU Memory", which can be found on the bottom left when you click on GPU.

CodePudding user response：

Adding to MK answer, if the cause of your kernel crash is indeed due to RAM/GPU limit. You could try to load your data in batches. Instead of splitting the entire datasets at the same time, try to divide maybe a quarter at a time.

CodePudding user response：

Notice that after splitting the data you are basically keeping 2 instances of the same data (the original (img_data, img_labels) and in split form). If you are running out of memory, the best is to manage it via an index array from which you implicitly pull batches as you need them.

Create shuffled array of indices,

shuffle_indices = np.random.permutation(img_data.shape[0])

which does the same as your two lines in one step.

Split the indices corresponding to points in the train and test sets:

train_indices, test_indices = train_test_split(shuffle_indices, test_size=0.2,random_state=42, shuffle=False))

Then, iterate on batches,

n_train = len(train_indices)
for epoch on range(n_epochs):
    # further shuffle the training data for each iteration, if desired
    epoch_shuffle = np.random.permutation(n_train)

    for i in range(n_train, step=batch_size):
        # get data batches
        x_batch = img_data[train_indices[epoch_shuffle[i*batch_size : (i 1)*batch_size]]]
        y_batch = img_labels[train_indices[epoch_shuffle[i*batch_size : (i 1)*batch_size]]]

        # train model
        ...