To train the image classification model I'm loading input data as NumPy array, I deal with thousands of images. Currently, I'm looping through each image and converting it into a NumPy array as shown below.
import glob
import cv2
import numpy as np
tem_arr_list = []
from time import time
images_list = glob.glob(r'C:\Datasets\catvsdogs\cat\*.jpg')
start = time()
for idx, image_path in enumerate(images_list):
start = time()
img = cv2.imread(image_path)
temp_arr = np.array(cv2.imread(image_path))
# print(temp_arr.shape)
tem_arr_list.append(temp_arr)
print("Total time taken {}".format (time() - start))
running this method takes a lot of time when data is huge. So I tried using list comprehension as below
tem_arr_list = [np.array(cv2.imread(image_path)) for image_path in images_list]
which is slight quicker than looping but not fastest
I'm looking any other way to reduce the time to do this operation . Any help or suggestion on this will be appreciated.
CodePudding user response:
Use the multiprocessing pool to load data parallely. In my PC the cpus count is 16. I tried loading 100 images and below you could see the time taken.
import multiprocessing
import cv2
import glob
from time import time
def load_image(image_path):
return cv2.imread(image_path)
if __name__ == '__main__':
image_path_list = glob.glob('Huge_dataset/*.png')
try:
cpus = multiprocessing.cpu_count()
except NotImplementedError:
cpus = 2 # arbitrary default
pool = multiprocessing.Pool(processes=cpus)
start = time()
images = pool.map(load_image, image_path_list)
print("Total time taken using multiprocessing pool {} seconds".format (time() - start))
images = []
start = time()
for image_path in image_path_list:
images.append(load_image(image_path))
print("Total time taken using for loop {} seconds".format (time() - start))
start = time()
images = [load_image(image_path) for image_path in image_path_list]
print("Total time taken using list comprehension {} seconds".format (time() - start))
Output:
Total time taken using multiprocessing pool 0.2922379970550537 seconds
Total time taken using for loop 1.4935636520385742 seconds
Total time taken using list comprehension 1.4925990104675293 seconds
CodePudding user response:
If you're working with numerical values, it is always a good practice to use numpy arrays instead of lists. So I suggest you to change the data type of tem_arr_list
into a numpy array, and stack the arrays there as a matrix instead of using a list (this is giving you worst performance). Then you can easily access to your data by indexing the new numpy matrix.