I have a HUGE list of objects.

That looks like this

output = [
   {
     'name': 'some name',
     'id': '1'
   }
]

clean_list = []
for i in range(len(output)):
    if output[i] not in output[i   1:]:
        clean_list.append(output[i])

It's 1 million, and this is the method I'm currently using.. however it takes hours to complete this operation when my array is massive.

Is there a optimal way to remove duplicates?

CodePudding user response：

There's two issues with the code you propose.

1. speed of "in" test

You wrote

clean_list = []

but you really want

clean_list = set()

That way in completes in O(1) constant time, rather than O(N) linear time.

The linear probe inside the for gives the loop O(N^2) quadratic cost.

2. equality test of hashable items

Your source items look like e.g. dict(name='some name', id=1). You want to turn them into hashable tuples, e.g. ('some name', 1), so you can take advantage of set's O(1) fast lookups.

from pprint import pp

clean = set()
for d in output:
    t = tuple(d.values())
    if t not in clean:
        clean.add(t)


pp(sorted(clean))

But wait! No need for checks, since a set will reject attempts to add same thing twice. So this suffices:

clean = set()
for d in output:
    clean.add(tuple(d.values()))

And now it's simple enough that a set comprehension makes sense.

clean = {tuple(d.values())
         for d in output}

Consider uniquifying on just name, if the id value doesn't matter to you.

tl;dr: Doing ~ 10^6 operations is way better than doing ~ 10^12 of them.

CodePudding user response：

for such a large dataset it makes sence to use pandas, something like this:

import pandas as pd

output = [
   {'name': 'some name 1','id': '1'},
   {'name': 'some name 2','id': '2'},
   {'name': 'some name 3','id': '3'},
   {'name': 'some name 2','id': '2'}]

clean_list = pd.DataFrame(output).drop_duplicates().to_dict('records')

>>> clean_list
'''
[{'name': 'some name 1', 'id': '1'},
 {'name': 'some name 2', 'id': '2'},
 {'name': 'some name 3', 'id': '3'}]