In the code below, the 3rd and 4th elements are considered the same, because 'start' and 'end' are just switched:
{'start': '222', 'end': '333', 'type':'c'},
{'start': '333', 'end': '222', 'type':'c'}
I need to build a relations list or set which don't contain duplicates like above. Supposed the input is from list_of_dicts, and my code is the following to achieve the purpose:
relations = []
list_of_dicts = [{'start': '123', 'end': '456', 'type':'a'},
{'start': '111', 'end': '122', 'type':'b'},
{'start': '222', 'end': '333', 'type':'c'},
{'start': '333', 'end': '222', 'type':'c'},
]
duplicate_keys = set()
for my_dict in list_of_dicts:
duplicate_key = ''.join(sorted(my_dict['start'] my_dict['end'] my_dict['type']))
if duplicate_key not in duplicate_keys:
relations.append(my_dict)
duplicate_keys.add(duplicate_key)
print(relations)
This seems to work. My list_of_dicts are supposed to be large, for example, 100 millions. Is this the fast way to do it? Also, the list_of_dicts here are illustrative purpose for convenience, but the 'relations' list are built from similar input.
CodePudding user response:
I think that is better to transform those dict into a specialize class, add to that class a hash function and let a set
or a dict
or similar take care of duplicates
>>> class MyObject:
def __init__(self,start,end,type):
self.data = (*sorted((start,end)),type)
def __hash__(self):
return hash(self.data)
def __repr__(self):
return f"{self.__class__.__name__}{self.data}"
def __eq__(self,other):
if isinstance(other,self.__class__):
return self.data == other.data
return False
>>> list_of_dicts = [{'start': '123', 'end': '456', 'type':'a'},
{'start': '111', 'end': '122', 'type':'b'},
{'start': '222', 'end': '333', 'type':'c'},
{'start': '333', 'end': '222', 'type':'c'},
]
>>> new=[MyObject(**x) for x in list_of_dicts]
>>> new
[MyObject('123', '456', 'a'), MyObject('111', '122', 'b'), MyObject('222', '333', 'c'), MyObject('222', '333', 'c')]
>>> set(new)
{MyObject('123', '456', 'a'), MyObject('222', '333', 'c'), MyObject('111', '122', 'b')}
>>>
CodePudding user response:
Something like this would be quite efficient for looping through a list of dicts and determining which are dupes (add anything to a set that is considered "seen"). Would require looping through each dict once, but constant time lookup to see if a key exists.
seen = set()
for my_dict in list_of_dicts:
if (my_dict['start'], my_dict['end']) in seen:
# dupe
continue
# not dupe
seen.add((my_dict['start'], my_dict['end']))
seen.add((my_dict['end'], my_dict['start']))
CodePudding user response:
To generate a sorting key that's order-independent only for specific, named keys (as initially defined by the sortable_keys
default parameter):
def order_independent_key(input_dict, sortable_keys=('start', 'end')):
local_dict = dict(input_dict)
sortable_vals = []
for key in sortable_keys:
sortable_vals.append(local_dict.pop(key))
return tuple(sortable_vals) tuple(sorted(local_dict.items()))
Thus:
list_of_dicts = [{'start': '123', 'end': '456', 'type':'a'},
{'start': '111', 'end': '122', 'type':'b'},
{'start': '222', 'end': '333', 'type':'c'},
{'start': '333', 'end': '222', 'type':'c'},
]
deduplicated_dicts = {}
for item in list_of_dicts:
deduplicated_dicts[order_independent_key(item)] = item
...will generate a structure like:
{('111', '122', ('type', 'b')): {'end': '122', 'start': '111', 'type': 'b'},
('123', '456', ('type', 'a')): {'end': '456', 'start': '123', 'type': 'a'},
('222', '333', ('type', 'c')): {'end': '333', 'start': '222', 'type': 'c'},
('333', '222', ('type', 'c')): {'end': '222', 'start': '333', 'type': 'c'}}
...for which doing a lookup by order_independent_key(another_dict)
is an operation with performance characteristics that scale only with the number of keys in another_dict
.
CodePudding user response:
Try to use dict and tuples for keys:
list_of_dicts = [{'start': '123', 'end': '456', 'type':'a'},
{'start': '111', 'end': '122', 'type':'b'},
{'start': '222', 'end': '333', 'type':'c'},
{'start': '333', 'end': '222', 'type':'c'},
]
for my_dict in list_of_dicts:
k = tuple(sorted(my_dict.values()))
if k not in relations:
relations[k] = my_dict
print(list(relations.values()))
And one-liner with a little-bit different behavior:
list_of_dicts = [{'start': '123', 'end': '456', 'type':'a'},
{'start': '111', 'end': '122', 'type':'b'},
{'start': '222', 'end': '333', 'type':'c'},
{'start': '333', 'end': '222', 'type':'c'},
]
relations = list({tuple(sorted(my_dict.values())): my_dict for my_dict in list_of_dicts}.values())
print(relations)
In this one-liner, latest value will be used in the dict.
This is equivalent to the full version, but without condition if k not in relations:
.
Result:
[{'start': '123', 'end': '456', 'type': 'a'}, {'start': '111', 'end': '122', 'type': 'b'}, {'start': '333', 'end': '222', 'type': 'c'}]