I’m working in a neural network and my dataset has 42000 images and I have to load it all. I’m using google colab for that, but every time I load the dataset the RAM is insufficient.
I am putting everything in a numpy array, cause I tried to use the ImageGenerator method and it didn’t work. I’m using the following code to load the data:
class = glob.glob(r"/content/drive/MyDrive/DATASET/class/*.*")
data = []
labels = []
for i in class:
image=tf.keras.preprocessing.image.load_img(i, color_mode='rgb',
target_size= (336, 336))
image=np.array(image)
data.append(image)
labels.append(0)
data = np.array(data)
labels = np.array(labels)
CodePudding user response:
As ImageDataGenerator
is deprecated, you can use a custom Keras Sequence
class to load images when needed.
The strategy here is to create a Pandas DataFrame with all the path and class of your images then transform the class to numeric label with pd.factorize
. Once, you have X (paths) and y (labels), you can use train_test_split
to extract 3 subsets: train, test and validation. The last step is to convert these collections to datasets compatible with Tensorflow.
Each time, Tensorflow process a batch, the Sequence will load a batch of images in memory and so on.
Step 0: Imports and constants
import tensorflow as tf
import pandas as pd
import numpy as np
import pathlib
from sklearn.model_selection import train_test_split
INPUT_SHAPE = (336, 336, 3)
BATCH_SIZE = 32
DATA_DIR = pathlib.Path('/content/drive/MyDrive/DATASET/')
Step 1: Load all image paths to a Pandas DataFrame:
# Find images of dataset
data = []
for file in DATA_DIR.glob('**/*.jpg'):
d = {'class': file.parent.name,
'path': file}
data.append(d)
# Create dataframe and select columns
df = pd.DataFrame(data)
df['label'] = pd.factorize(df['class'])[0]
X = df['path']
y = df['label']
# Split into 3 balanced datasets
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.2, random_state=2023)
X_train, X_valid, y_train, y_valid = \
train_test_split(X_train, y_train, test_size=0.2, random_state=2023)
Step 2: Create a custom data Sequence
class ImgDataSequence(tf.keras.utils.Sequence):
"""
Check documentation here: https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence
"""
def __init__(self, image_set, label_set, batch_size=32, image_size=(256, 256)):
self.image_set = np.array(image_set)
self.label_set = np.array(label_set)
self.batch_size = batch_size
self.image_size = image_size
def __get_image(self, image):
image = tf.keras.preprocessing.image.load_img(image, color_mode='rgb', target_size=self.image_size)
image_arr = tf.keras.preprocessing.image.img_to_array(image)
return image_arr
def __get_data(self, images, labels):
image_batch = np.asarray([self.__get_image(img) for img in images])
label_batch = np.asarray(labels)
return image_batch, label_batch
def __getitem__(self, index):
images = self.image_set[index * self.batch_size:(index 1) * self.batch_size]
labels = self.label_set[index * self.batch_size:(index 1) * self.batch_size]
images, labels = self.__get_data(images, labels)
return images, labels
def __len__(self):
return len(self.image_set) // self.batch_size (len(self.image_set) % self.batch_size > 0)
Step 3: Create datasets
train_ds = ImgDataSequence(X_train, y_train, image_size=INPUT_SHAPE[:2], batch_size=BATCH_SIZE)
valid_ds = ImgDataSequence(X_valid, y_valid, image_size=INPUT_SHAPE[:2], batch_size=BATCH_SIZE)
test_ds = ImgDataSequence(X_test, y_test, image_size=INPUT_SHAPE[:2], batch_size=BATCH_SIZE)
Test the new datasets:
# Take the first batch of our train dataset
>>> imgs, labels = train_ds[0]
# Check then length (BATCH_SIZE)
>>> len(labels)
32
# Check the dimension of one image
>>> imgs[0].shape
(336, 336, 3)
How to use it with Tensorflow?
# train_ds & valid_ds to fit
history = model.fit(train_ds, epochs=10, validation_data=valid_ds)
# test_ds to evaluate
loss, *metrics = model.evaluate(test_ds)