Home > OS >  Removing duplicates from the list of pydantic objects
Removing duplicates from the list of pydantic objects

Time:10-26

I tried to remove duplicates from the list of pydantic objects, but faced a problem that I could not solve. The only working method is very slow.

Is there a faster way to remove duplicates than my method?

Code:

Pydantic model (a.py)

from pydantic import BaseModel


class Photo(BaseModel):
    title: str
    url: str

Main file (b.py)

from collections import OrderedDict
from a import Photo

#  3 objects, 2 duplicates
a_obj = {
    'title': 'SOME TITLE v1',
    'url': 'http://some.url'
}
b_obj = {
    'title': 'SOME TITLE v2',
    'url': 'http://different.url'
}
c_obj = {
    'title': 'SOME TITLE v1',
    'url': 'http://some.url'
}

#  Creating list of pydantic objects
pd_obj_list = list()
pd_obj_list  = [Photo(**a_obj)]
pd_obj_list  = [Photo(**b_obj)]
pd_obj_list  = [Photo(**c_obj)]

#  My Attempts to Remove Duplicates

#  Using OrderedDict.fromkeys
final_list_0 = list(OrderedDict.fromkeys(pd_obj_list))
#  returns TypeError: unhashable type: 'Photo'

#  Using Set
final_list_1 = list(set(pd_obj_list))
#  returns TypeError: unhashable type: 'Photo'

#  Using enumerate
final_list_2 = [i for n, i in enumerate(pd_obj_list) if i not in pd_obj_list[:n]]
#  It works but too slow when I have ~10k objects in the list

CodePudding user response:

Use:

pd_obj_list = [Photo(**a_obj), Photo(**b_obj), Photo(**c_obj)]
final_list_0 = list(OrderedDict(((photo.title, photo.url), photo) for photo in pd_obj_list).values())
print(final_list_0)

Output

[Photo(title='SOME TITLE v1', url='http://some.url'), Photo(title='SOME TITLE v2', url='http://different.url')]

If Photo is inmutable you could define __hash__ as follows:

from collections import OrderedDict

from pydantic import BaseModel


class Photo(BaseModel):
    title: str
    url: str

    def __hash__(self):
        return hash((self.title, self.url))


#  3 objects, 2 duplicates
a_obj = {
    'title': 'SOME TITLE v1',
    'url': 'http://some.url'
}
b_obj = {
    'title': 'SOME TITLE v2',
    'url': 'http://different.url'
}
c_obj = {
    'title': 'SOME TITLE v1',
    'url': 'http://some.url'
}

pd_obj_list = [Photo(**a_obj), Photo(**b_obj), Photo(**c_obj)]
final_list_0 = list(OrderedDict.fromkeys(pd_obj_list))
print(final_list_0)

Output

[Photo(title='SOME TITLE v1', url='http://some.url'), Photo(title='SOME TITLE v2', url='http://different.url')]
  • Related