Removing all duplicate images with different filenames from a directory-CodePudding

I am trying to iterate through a folder and delete any file that is a duplicate image (but different name). After running this script all files get deleted except for one. There are at least a dozen unique ones out of about 5,000. Any help understanding why this is happening would be appreciated.

import os
import cv2 

directory = r'C:\Users\Grid\scratch'
 
for filename in os.listdir(directory):
    a=directory '\\' filename
    n=(cv2.imread(a))
    q=0
    for filename in os.listdir(directory):
        b=directory '\\' filename
        m=(cv2.imread(b))
        comparison = n == m
        equal_arrays = comparison.all()
        if equal_arrays==True and q==1:
            os.remove(b)
        q=1

CodePudding user response：

There are a few issues with your code, and it's confusing that it could run at all without throwing an exception.

A few pointers: You only need to get the directory contents once. It also would be much simpler to collect md5 or sha1 hashes of the files while iterating the directory and then remove duplicates along the way.

for example:

import hashlib
import os

hashes = set()

for filename in os.listdir(directory):
    path = os.path.join(directory, filename)
    digest = hashlib.sha1(open(path,'rb').read()).digest()
    if digest not in hashes:
        hashes.add(digest)
    else:
        os.remove(path)

You can use a more secure hash if you would like but the chances of encountering a collision are astronomically low.