I have a HUGE list of objects.
That looks like this
output = [
{
'name': 'some name',
'id': '1'
}
]
clean_list = []
for i in range(len(output)):
if output[i] not in output[i 1:]:
clean_list.append(output[i])
It's 1 million, and this is the method I'm currently using.. however it takes hours to complete this operation when my array is massive.
Is there a optimal way to remove duplicates?
CodePudding user response:
There's two issues with the code you propose.
1. speed of "in" test
You wrote
clean_list = []
but you really want
clean_list = set()
That way in
completes in O(1) constant
time, rather than O(N) linear time.
The linear probe inside the for
gives the loop O(N^2) quadratic cost.
2. equality test of hashable items
Your source items look like e.g. dict(name='some name', id=1)
.
You want to turn them into hashable tuples,
e.g. ('some name', 1)
,
so you can take advantage of set
's O(1) fast lookups.
from pprint import pp
clean = set()
for d in output:
t = tuple(d.values())
if t not in clean:
clean.add(t)
pp(sorted(clean))
But wait! No need for checks, since a set
will
reject attempts to add same thing twice.
So this suffices:
clean = set()
for d in output:
clean.add(tuple(d.values()))
And now it's simple enough that a set comprehension makes sense.
clean = {tuple(d.values())
for d in output}
Consider uniquifying on just name
,
if the id
value doesn't matter to you.
tl;dr: Doing ~ 10^6 operations is way better than doing ~ 10^12 of them.
CodePudding user response:
for such a large dataset it makes sence to use pandas
, something like this:
import pandas as pd
output = [
{'name': 'some name 1','id': '1'},
{'name': 'some name 2','id': '2'},
{'name': 'some name 3','id': '3'},
{'name': 'some name 2','id': '2'}]
clean_list = pd.DataFrame(output).drop_duplicates().to_dict('records')
>>> clean_list
'''
[{'name': 'some name 1', 'id': '1'},
{'name': 'some name 2', 'id': '2'},
{'name': 'some name 3', 'id': '3'}]