for Cryptographie course in my university i need to compare a lot of SHA Hashes which are saved in an array. I need to compare the values of the Array Indexes.
There are duplicates in the Array - i already checked that with a comparison between the length of a set of the array and the length of the array itself.
Now i need to have the indexes of the duplicated values. I found a lot of solutions for checking for duplicates, but only for short arrays. My Array has a length of 3 Million and the Values in each index are around this length: 864205495604807476120572616017955259175325408501.
I have written a nested loop (coming from Java and trying to learn python). Here my code:
counter_outer = 0
while counter_outer < len(hash_value_array):
counter_inner = counter_outer 1
while counter_inner < len(hash_value_array):
if hash_value_array[counter_outer] == hash_value_array[counter_inner]:
print(f"*****FOUND MATCH *****")
print(f"Message [{counter_outer}] Hashvalue has same Value as Message [{counter_inner}]")
safe_index1 = counter_outer
safe_index2 = counter_inner
counter_outer = len(hash_value_array)
break
else:
print("------NO Match-----")
counter_inner = 1
counter_outer = 1
As you can imagine ... this takes ages.
Important for me is, that i need the indexes where the duplicates are - not the values. So for example, if there is a 898 in index 100 and a 898 in index 1000001 i only need as output: 100, 1000001
Any suggestions?
CodePudding user response:
You can do something along these lines in Python:
Assume this list of 5 signatures (they could be ints or strings, but I have strings):
li=['864205495604807476120572616017955259175325408501',
'864205495604807476120572616017955259175325408502',
'864205495604807476120572616017955259175325408503',
'864205495604807476120572616017955259175325408501',
'864205495604807476120572616017955259175325408502']
You can make a dict of lists with each list being the index of duplicates:
idx={}
for i, sig in enumerate(li):
idx.setdefault(sig, []).append(i)
You can then find the duplicates like so:
for sig, v in idx.items():
if len(v)>1:
print(f'{sig}: {v}')
Prints:
864205495604807476120572616017955259175325408501: [0, 3]
864205495604807476120572616017955259175325408502: [1, 4]
If you make li
3,000,000 entries, that runs in about 550 milliseconds on my computer and likely would be similar on yours.
But to be honest - i dont understand why this is so much faster.
Yours is super slow because it has On**2 complexity from nested while
loops. You are looping over the entire array for each element. The method I showed you here is only looping once over the entire list -- not 3 million times!