Home > OS >  How to convert images into numpy array quickly?
How to convert images into numpy array quickly?

Time:02-08

To train the image classification model I'm loading input data as NumPy array, I deal with thousands of images. Currently, I'm looping through each image and converting it into a NumPy array as shown below.

import glob
import cv2
import numpy as np

tem_arr_list = []
from time import time 
images_list = glob.glob(r'C:\Datasets\catvsdogs\cat\*.jpg')
start = time()
for idx, image_path in enumerate(images_list):
    start = time()
    img = cv2.imread(image_path)
    temp_arr = np.array(cv2.imread(image_path))
#     print(temp_arr.shape)
    tem_arr_list.append(temp_arr)
print("Total time taken {}".format (time() - start))

running this method takes a lot of time when data is huge. So I tried using list comprehension as below

tem_arr_list = [np.array(cv2.imread(image_path)) for image_path in images_list] 

which is slight quicker than looping but not fastest

I'm looking any other way to reduce the time to do this operation . Any help or suggestion on this will be appreciated.

CodePudding user response:

Use the multiprocessing pool to load data parallely. In my PC the cpus count is 16. I tried loading 100 images and below you could see the time taken.

import multiprocessing
import cv2
import glob
from time import time 

def load_image(image_path):
    return cv2.imread(image_path)

if __name__ == '__main__':
    image_path_list = glob.glob('Huge_dataset/*.png')
        
        
    try:
        cpus = multiprocessing.cpu_count()
    except NotImplementedError:
        cpus = 2   # arbitrary default
    
    pool = multiprocessing.Pool(processes=cpus)
    
    start = time()
    images = pool.map(load_image, image_path_list)
    print("Total time taken using multiprocessing pool {} seconds".format (time() - start))
    
    images = []
    start = time()
    for image_path in image_path_list:
        images.append(load_image(image_path))
    print("Total time taken using for loop {} seconds".format (time() - start))
    
    
    start = time()
    images = [load_image(image_path) for image_path in image_path_list]        
    print("Total time taken using list comprehension {} seconds".format (time() - start))

Output:

Total time taken using multiprocessing pool 0.2922379970550537 seconds
Total time taken using for loop 1.4935636520385742 seconds
Total time taken using list comprehension 1.4925990104675293 seconds

CodePudding user response:

If you're working with numerical values, it is always a good practice to use numpy arrays instead of lists. So I suggest you to change the data type of tem_arr_list into a numpy array, and stack the arrays there as a matrix instead of using a list (this is giving you worst performance). Then you can easily access to your data by indexing the new numpy matrix.

  •  Tags:  
  • Related