Home > Enterprise >  Reading images with CV2 is too slow
Reading images with CV2 is too slow

Time:09-17

I have 6000 images with 300*300 pixels, and I have a time problem when I read these images in python. I need to collect all of the images in a list so that I can use them for my model. so I write a for loop, read each image, and append it into the X, as a blow code:

train_img=sorted(list(paths.list_images("path")))
X=[]
y=[]
for img in train_img:
    X.append(cv2.imread(img))
    y.append(img.split(os.path.sep)[6])

but it is very slow! every time I want to work with this data, I have to spend a lot of time collecting all images in one List!

so, Can you give me some advice or recommendations for my problem? and, Is there a package that reads images faster than the Open-CV?

CodePudding user response:

There is a nice benchmark of different approaches to reading here. According to it pyvips and PIL are good options to consider.

For example,

from PIL import Image
import numpy as np
...
im = np.asarray(Image.open(f))
...

Also, as it was suggested in a comment, it might be useful to consider other formats for storing images. I guess TIFF or BMP might work out.

CodePudding user response:

Saving and loading compressed image formats will always cost more time than reading uncompressed formats.

You didn't say if you're using JPEG or PNG, which are compressed, or BMP, which is uncompressed. TIFF can be compressed or uncompressed (it can hold JPEG data).

You should convert your data to an uncompressed format. That will take more disk space. Some specific formats of BMP or TIFF can even be "memory-mapped", hence not require much RAM, regardless of size.

I wouldn't recommend "pickling" your data. It's just image data, not arbitrary/general data. It ought to be stored in a typical image file format.

Your choice of libraries:

Also have a look at pyvips which seems to be a wrapper around libvips. I haven't used it but another answer pointed it out.

CodePudding user response:

The task is probably I/O bound. Try concurrent.futures, to read the images in parallel/asynchronously. If the files are stored on a slow media such as a network share, use a large number of threads (e.g 32), else less (~CPU count).

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=32) as executor:
    X = list(executor.map(cv2.imread, train_img))

Consider dask or dask-image for more sophisticated use cases.

  • Related