How to efficiently build a list of dicts without duplicates?-CodePudding

In the code below, the 3rd and 4th elements are considered the same, because 'start' and 'end' are just switched:

{'start': '222',  'end': '333', 'type':'c'},
{'start': '333',  'end': '222', 'type':'c'}

I need to build a relations list or set which don't contain duplicates like above. Supposed the input is from list_of_dicts, and my code is the following to achieve the purpose:

relations = []
list_of_dicts = [{'start': '123',  'end': '456', 'type':'a'},
                  {'start': '111',  'end': '122', 'type':'b'},
                  {'start': '222',  'end': '333', 'type':'c'},
                  {'start': '333',  'end': '222', 'type':'c'},
                  ]

duplicate_keys = set()
for my_dict in list_of_dicts:
    duplicate_key = ''.join(sorted(my_dict['start']   my_dict['end']   my_dict['type']))
    if duplicate_key not in duplicate_keys:
        relations.append(my_dict)
        duplicate_keys.add(duplicate_key)

print(relations)

This seems to work. My list_of_dicts are supposed to be large, for example, 100 millions. Is this the fast way to do it? Also, the list_of_dicts here are illustrative purpose for convenience, but the 'relations' list are built from similar input.

CodePudding user response：

I think that is better to transform those dict into a specialize class, add to that class a hash function and let a set or a dict or similar take care of duplicates

>>> class MyObject:
        def __init__(self,start,end,type):
            self.data = (*sorted((start,end)),type)
        def __hash__(self):
            return hash(self.data)
        def __repr__(self):
            return f"{self.__class__.__name__}{self.data}"
        def __eq__(self,other):
            if isinstance(other,self.__class__):
                return self.data == other.data
            return False

    
>>> list_of_dicts = [{'start': '123',  'end': '456', 'type':'a'},
                  {'start': '111',  'end': '122', 'type':'b'},
                  {'start': '222',  'end': '333', 'type':'c'},
                  {'start': '333',  'end': '222', 'type':'c'},
                  ]
>>> new=[MyObject(**x) for x in list_of_dicts]
>>> new
[MyObject('123', '456', 'a'), MyObject('111', '122', 'b'), MyObject('222', '333', 'c'), MyObject('222', '333', 'c')]
>>> set(new)
{MyObject('123', '456', 'a'), MyObject('222', '333', 'c'), MyObject('111', '122', 'b')}
>>>

CodePudding user response：

Something like this would be quite efficient for looping through a list of dicts and determining which are dupes (add anything to a set that is considered "seen"). Would require looping through each dict once, but constant time lookup to see if a key exists.

seen = set()

for my_dict in list_of_dicts:
    if (my_dict['start'], my_dict['end']) in seen:
        # dupe
        continue
    # not dupe
    seen.add((my_dict['start'], my_dict['end']))
    seen.add((my_dict['end'], my_dict['start']))

CodePudding user response：

To generate a sorting key that's order-independent only for specific, named keys (as initially defined by the sortable_keys default parameter):

def order_independent_key(input_dict, sortable_keys=('start', 'end')):
    local_dict = dict(input_dict)
    sortable_vals = []
    for key in sortable_keys:
        sortable_vals.append(local_dict.pop(key))
    return tuple(sortable_vals)   tuple(sorted(local_dict.items()))

Thus:

list_of_dicts = [{'start': '123',  'end': '456', 'type':'a'},
                  {'start': '111',  'end': '122', 'type':'b'},
                  {'start': '222',  'end': '333', 'type':'c'},
                  {'start': '333',  'end': '222', 'type':'c'},
                  ]
deduplicated_dicts = {}
for item in list_of_dicts:
    deduplicated_dicts[order_independent_key(item)] = item

...will generate a structure like:

{('111', '122', ('type', 'b')): {'end': '122', 'start': '111', 'type': 'b'},
 ('123', '456', ('type', 'a')): {'end': '456', 'start': '123', 'type': 'a'},
 ('222', '333', ('type', 'c')): {'end': '333', 'start': '222', 'type': 'c'},
 ('333', '222', ('type', 'c')): {'end': '222', 'start': '333', 'type': 'c'}}

...for which doing a lookup by order_independent_key(another_dict) is an operation with performance characteristics that scale only with the number of keys in another_dict.

CodePudding user response：

Try to use dict and tuples for keys:

list_of_dicts = [{'start': '123',  'end': '456', 'type':'a'},
                  {'start': '111',  'end': '122', 'type':'b'},
                  {'start': '222',  'end': '333', 'type':'c'},
                  {'start': '333',  'end': '222', 'type':'c'},
                  ]

for my_dict in list_of_dicts:
    k = tuple(sorted(my_dict.values()))

    if k not in relations:
        relations[k] = my_dict


print(list(relations.values()))

And one-liner with a little-bit different behavior:

list_of_dicts = [{'start': '123',  'end': '456', 'type':'a'},
                  {'start': '111',  'end': '122', 'type':'b'},
                  {'start': '222',  'end': '333', 'type':'c'},
                  {'start': '333',  'end': '222', 'type':'c'},
                  ]

relations = list({tuple(sorted(my_dict.values())): my_dict for my_dict in list_of_dicts}.values())

print(relations)

In this one-liner, latest value will be used in the dict. This is equivalent to the full version, but without condition if k not in relations:.

Result:

[{'start': '123', 'end': '456', 'type': 'a'}, {'start': '111', 'end': '122', 'type': 'b'}, {'start': '333', 'end': '222', 'type': 'c'}]