Python: Iterate through Big Array with BigInts, find first duplicate and printout the indexes of the-CodePudding

for Cryptographie course in my university i need to compare a lot of SHA Hashes which are saved in an array. I need to compare the values of the Array Indexes.

There are duplicates in the Array - i already checked that with a comparison between the length of a set of the array and the length of the array itself.

Now i need to have the indexes of the duplicated values. I found a lot of solutions for checking for duplicates, but only for short arrays. My Array has a length of 3 Million and the Values in each index are around this length: 864205495604807476120572616017955259175325408501.

I have written a nested loop (coming from Java and trying to learn python). Here my code:

counter_outer = 0
while counter_outer < len(hash_value_array):
    counter_inner = counter_outer   1
    while counter_inner < len(hash_value_array):
        if hash_value_array[counter_outer] == hash_value_array[counter_inner]:
            print(f"*****FOUND MATCH *****")
            print(f"Message [{counter_outer}] Hashvalue has same Value as Message [{counter_inner}]")
            safe_index1 = counter_outer
            safe_index2 = counter_inner
            counter_outer = len(hash_value_array)
            break
        else:
            print("------NO Match-----")
        counter_inner  = 1
    counter_outer  = 1

As you can imagine ... this takes ages.

Important for me is, that i need the indexes where the duplicates are - not the values. So for example, if there is a 898 in index 100 and a 898 in index 1000001 i only need as output: 100, 1000001

Any suggestions?

CodePudding user response：

You can do something along these lines in Python:

Assume this list of 5 signatures (they could be ints or strings, but I have strings):

li=['864205495604807476120572616017955259175325408501',
'864205495604807476120572616017955259175325408502',
'864205495604807476120572616017955259175325408503',
'864205495604807476120572616017955259175325408501',
'864205495604807476120572616017955259175325408502']

You can make a dict of lists with each list being the index of duplicates:

idx={}
for i, sig in enumerate(li):
    idx.setdefault(sig, []).append(i)

You can then find the duplicates like so:

for sig, v in idx.items():
    if len(v)>1:
        print(f'{sig}: {v}')

Prints:

864205495604807476120572616017955259175325408501: [0, 3]
864205495604807476120572616017955259175325408502: [1, 4]

If you make li 3,000,000 entries, that runs in about 550 milliseconds on my computer and likely would be similar on yours.

But to be honest - i dont understand why this is so much faster.

Yours is super slow because it has On**2 complexity from nested while loops. You are looping over the entire array for each element. The method I showed you here is only looping once over the entire list -- not 3 million times!