Home > Software design >  not able to remove duplicate image with hashing
not able to remove duplicate image with hashing

Time:12-16

My aim is to remove identical images like the following:

Image 1: https://i.stack.imgur.com/8dLPo.png

Image 2: https://i.stack.imgur.com/hF11m.png

Currently, I am using average hashing with

  • the hash size of 32 (hash size less than this is giving collision )
  • thresh hold of 10-20

I tried Phash as well, but it is removing almost similar images like the following, (which I don't want)

Image 3: https://i.stack.imgur.com/CwZ09.png

Image 4: https://i.stack.imgur.com/HvAaJ.png

So I am looking for some technique through which I can identify that

  • Image 1 and Image 2 are identical
  • Image 3 and Image 4 are Distinct

kindly help because I have been stuck on this problem for so long.

Note: Every time type/kind of images would be different so I can't even invest time to learn deep learning and give it a try.

CodePudding user response:

I tried Phash as well, but it is removing almost similar images like the following, (which I don't want)

The problem you are facing arises due to the fact that most libraries which use perceptual hashing convert the image to a gray-scale representation, then resize it and finally compute a hash based on the image. Another approach is to calculate the average gradient of pixels and then hash the result.

This process is used in order to save computational resources and make the hashing more efficient - but for pictures being VERY similar, you end up getting the same hash.

You could solve this issue by comparing their exact hash:

import hashlib

with open("HvAaJ.png", "rb") as image1:
    data1 = image1.read()
with open("CwZ09.png", "rb") as image2:
    data2 = image2.read()

hash1 = hashlib.sha256(data1).hexdigest()
hash2 = hashlib.sha256(data2).hexdigest()

if hash1 == hash2:
    #...

I tested the method with your pictures above and the results were exactly as you wanted. The both example pictures don't have the same hash, whereas comparing "HvAaJ" and "HvAaJCOPY" will yield the same hash!

There is this great StackOverflow thread which tackles the exact same approach.

Now your task mentioned that "Every time type/kind of images [is different]" - which would cause a major headache for different file types. Now that's where you need to decide: a.) Either go with method 1 (as described above) and risk that a picture which is identical (but due to formation reasons - which pretty likely won't happen often) will be flagged as different or b.) Actually consider the "content" of the Image and calculate an approximation, e.g. via greyscaling and resizing and then comparing the average hash you get. You might end up mistaking SIMILAR looking pictures for being identical.

The second option (using the imagehash library) might look like this:

from PIL import Image
import imagehash
import numpy

image1 = Image.open("HvAaJ.png")
image2 = Image.open("CwZ09.png")

hash1 = imagehash.average_hash(image1)
hash2 = imagehash.average_hash(image2)

if numpy.array_equal(hash1.hash, hash2.hash):
    #...

Both the Images you supplied didn't have the same hash and thus solving the issue you presented.

CodePudding user response:

So I can't even invest time to learn deep learning and give it a try.

Something which might be beneficial for you to try is a similar image search utility using Locality Sensitive Hashing (LSH) and random projection on top of the image representations computed by a pretrained image classifier. This kind of search engine is also known as a near-duplicate (or near-dup) image detector.

You need to load a pretrained model and cut out an embedding layer block

You can download one like that

!wget -q https://git.io/JuMq0 -O flower_model_bit_0.96875.zip
!unzip -qq flower_model_bit_0.96875.zip

Another one might be better suiting (e.g. one for imagenet or similar).

You then use the output of the embedding model

bit_model = tf.keras.models.load_model("flower_model_bit_0.96875")

embedding_model = tf.keras.Sequential(
    [
        tf.keras.layers.Input((IMAGE_SIZE, IMAGE_SIZE, 3)),
        tf.keras.layers.Rescaling(scale=1.0 / 255),
        bit_model.layers[1],
        tf.keras.layers.Normalization(mean=0, variance=1),
    ],
    name="embedding_model",
)

like you would when calculating hashes

def hash_func(embedding, random_vectors):
    embedding = np.array(embedding)

    # Random projection.
    bools = np.dot(embedding, random_vectors) > 0
    return [bool2int(bool_vec) for bool_vec in bools]


def bool2int(x):
    y = 0
    for i, j in enumerate(x):
        if j:
            y  = 1 << i
    return y

The full tutorial is described here.

  • Related