I have a nested list with 514000 list.
I have to find duplicated list and store their position.
I wrote a code, but with bad efficiency way.514000*514000
Do you have a goold solution with good efficiency way ?
my nested list:
[["src_ip","src_port","dst_ip","dst_port"],["src_ip","src_port","dst_ip","dst_port"],
and so on
]
CodePudding user response:
A much faster solution is to use a set
to track the items already seen so far and add the location of the duplicates to a list. Because lists are not hashable, they can be converted to a tuple. Here is the resulting code:
seen = set() # Unique items
result = [] # List of duplicate positions
for pos, item in enumerate(lst):
tmp = tuple(item)
if tmp in seen:
result.append(pos)
else:
seen.add(tmp)
If you want to track where is the unique item associated to the duplicates found, you can just change seen
to a dict
and change seen.add(tmp)
to seen[tmp] = pos
. The resulting solution runs in linear time (ie. O(n)
where n
is the size of the input lst
). This code takes about 250 ms on my machine with a list containing 514000 sub-lists of 4 short random strings.