I am trying to iterate through a folder and delete any file that is a duplicate image (but different name). After running this script all files get deleted except for one. There are at least a dozen unique ones out of about 5,000. Any help understanding why this is happening would be appreciated.
import os
import cv2
directory = r'C:\Users\Grid\scratch'
for filename in os.listdir(directory):
a=directory '\\' filename
n=(cv2.imread(a))
q=0
for filename in os.listdir(directory):
b=directory '\\' filename
m=(cv2.imread(b))
comparison = n == m
equal_arrays = comparison.all()
if equal_arrays==True and q==1:
os.remove(b)
q=1
CodePudding user response:
There are a few issues with your code, and it's confusing that it could run at all without throwing an exception.
A few pointers: You only need to get the directory contents once. It also would be much simpler to collect md5
or sha1
hashes of the files while iterating the directory and then remove duplicates along the way.
for example:
import hashlib
import os
hashes = set()
for filename in os.listdir(directory):
path = os.path.join(directory, filename)
digest = hashlib.sha1(open(path,'rb').read()).digest()
if digest not in hashes:
hashes.add(digest)
else:
os.remove(path)
You can use a more secure hash if you would like but the chances of encountering a collision are astronomically low.