Problem:
I've got around 10,000 images to compare to each other. My current program compares around 60 images every second, but at that speed, it would take nearly 9 days of runtime to finish. I've tried using c but the final code would take nearly 3x as long as the python one.
Question:
Is there any faster or more efficient way to compare images? I'm fine with using other languages and other libraries.
Code:
from PIL import Image
from PIL import ImageChops
import math, operator
from functools import reduce
import os
def rmsdiff(image_1, image_2):
h = ImageChops.difference(image_1, image_2).histogram()
return math.sqrt(reduce(operator.add, map(lambda h, i: i%6*(h**2), h, range(len(h)))) / (float(image_1.size[0]) * image_1.size[1]))
current = 0
try:
dire = "C:\\Users\\Nikola\\Downloads\\photos"
photos = os.listdir(dire)
for idx, val in enumerate(photos):
if val == "":
start = idx
break
for photo_1 in range(start,len(photos)):
if "." not in photos[photo_1]:
continue
print(f'Image: {photos[photo_1]}')
with Image.open(dire "\\" photos[photo_1]) as image_1:
image_1 = image_1.resize((16,16))
for photo_2 in range(photo_1 1, len(photos)):
current = photos[photo_2]
try:
if photos[photo_2][-4] != "." and photos[photo_2][-5] != ".":
continue
except:
continue
with Image.open(dire "\\" photos[photo_2]) as image_2:
image_2 = image_2.resize((16,16))
try:
value = rmsdiff(image_1, image_2)
if value < 12:
print(f'Similar Image: {photos[photo_1]}')
continue
except:
pass
except KeyboardInterrupt:
print()
print(current)
CodePudding user response:
Your problem is quite strange, though. The fact that you have to read the data itself in order to compare is not something that should happen most of the time, and it would make the most sense that you have some metadata to compare by.
That said, here are some very different approaches to speeding this up.
- You could parallelize the code to achieve a speedup of the number of your cores.
- You could change rmse to simple abs(diff) or another cheaper distance function. to save a lot of calculations runtime.
- You could write your own diff method, to stop calculating when passing some diff threshold. This requires the function to be compiled or at least just in time compiled.
- If you could pre-calculate some dimensionality reduction for each image, you could perform the comparison in a lower dimension. For example, sum the rows, and get a column of the sums for each image. Compare that column instead of the entire image. Then only calculate on the entire image for images that came out with similar lower dimension representations.
- If a lot of your images are the same, you could group them, then when comparing against any image in a group, you won't have to do it again for all the other images in the group.
- A quick speedup can probably be obtained by correctly using cdist, calculating all-to-all distances
CodePudding user response:
Following on from my comments, I'm suggesting that loading and resizing take the most time, so that is where I'd aim to optimise.
I don't have a Python interpreter available at the moment to test properly, but along these lines:
from functools import lru_cache
@lru_cache(maxsize=None)
def loadImage(filename)
im = Image.open(filename)
im = im.resize((16,16))
return im
That should already make a massive difference. Then adjust to use "draft" mode, something like:
im = Image.open(filename)
im.draft('RGB',(32,32))
im = im.resize((16,16)
return im
You could also multithread the loading if your laptop has a decent CPU.