I have a list of dictionaries with the keys street
, number
and some_flag
.
My goal is to search the dicts for duplicates in the keys street
and number
. If for two or more dicts these two key/value pairs are identical, I want to assign the value 1 to their some_flag
key.
Please see reproducible example below.
Starting list of dictionaries:
a = [
{'street': 'ocean drive', 'number': '1', 'some_flag': 0},
{'street': 'ocean drive', 'number': '3', 'some_flag': 0},
{'street': 'ocean drive', 'number': '4', 'some_flag': 0}, # duplicate street / number keys
{'street': 'ocean drive', 'number': '4', 'some_flag': 0}, # duplicate street / number keys
{'street': 'apple tree rd.', 'number': '3', 'some_flag': 0},
]
Expected output:
a_checked = [
{'street': 'ocean drive', 'number': '1', 'some_flag': 0},
{'street': 'ocean drive', 'number': '3', 'some_flag': 0},
{'street': 'ocean drive', 'number': '4', 'some_flag': 1}, # duplicate street / number keys
{'street': 'ocean drive', 'number': '4', 'some_flag': 1}, # duplicate street / number keys
{'street': 'apple tree rd.', 'number': '3', 'some_flag': 0},
]
My best effort:
The code I've got so far is derived from Aarons answer (here) and the community wiki's answer (here)
from collections import defaultdict, Counter
items = defaultdict(list) # create defaultdict
for row in a:
items[row['street']].append(row['number']) # make a list of 'number' values for each 'street' key
for key in items.keys():
if checkIfDuplicates(items[key]): #if there is more than one 'number' --> function definition see below
duplicate_dict = {}
duplicate_dict['numbers'] = [item for item, count in Counter(items[key]).items() if count > 1] # storing duplicate numbers in dict
duplicate_dict['street'] = key # storing street name in same dict
Function to check if given list contains any duplicates (from here):
def checkIfDuplicates(listOfElems):
if len(listOfElems) == len(set(listOfElems)):
return False
else:
return True
current output:
print(duplicate_dict)
{'numbers': ['4'], 'street': 'ocean drive'}
With my approach, I would now have to match the duplicate_dict
with the original list a
, which doesn't seem very efficient.
Are there more direct ways to solve this problem?
CodePudding user response:
Simply can be done by:
import pandas as pd
import numpy as np
# Your code
a = [
{'street': 'ocean drive', 'number': '1', 'some_flag': 0},
{'street': 'ocean drive', 'number': '3', 'some_flag': 0},
{'street': 'ocean drive', 'number': '4', 'some_flag': 0}, # duplicate street / number keys
{'street': 'ocean drive', 'number': '4', 'some_flag': 0}, # duplicate street / number keys
{'street': 'apple tree rd.', 'number': '3', 'some_flag': 0},
]
# My code
data = pd.DataFrame(a)
data["some_flag"]= np.where(data.duplicated(keep=False), 1, data["some_flag"])
data
Output
street | number | some_flag | |
---|---|---|---|
0 | ocean drive | 1 | 0 |
1 | ocean drive | 3 | 0 |
2 | ocean drive | 4 | 1 |
3 | ocean drive | 4 | 1 |
4 | apple tree rd. | 3 | 0 |
If you are interested in having a dictionary instead of the dataframe, you can try changing the dataframe to dictionary using to_dict
function:
data.to_dict(orient="list")
which results in :
{'number': ['1', '3', '4', '4', '3'],
'some_flag': [0, 0, 1, 1, 0],
'street': ['ocean drive',
'ocean drive',
'ocean drive',
'ocean drive',
'apple tree rd.']}
Explanation
Using duplicated
function on a dataframe and assigning False
to the keep
parameter, you can find the duplicated rows. Then using where
which is a numpy function, you can assign a value to an array based on a logical statement.
CodePudding user response:
You could use dict.setdefault
to first store a dict of lists (where the keys are "street" and "number") and then iterate over the values of this dictionary to check if multiple dicts have the same "street" and "number" and modify "some_flag" of those that are multiple:
tmp = {}
for d in a:
tmp.setdefault((d['street'], d['number']), []).append(d)
out = []
for v in tmp.values():
if len(v) > 1:
for d in v:
d['some_flag'] = 1
out.extend(v)
Output:
[{'street': 'ocean drive', 'number': '1', 'some_flag': 0},
{'street': 'ocean drive', 'number': '3', 'some_flag': 0},
{'street': 'ocean drive', 'number': '4', 'some_flag': 1},
{'street': 'ocean drive', 'number': '4', 'some_flag': 1},
{'street': 'apple tree rd.', 'number': '3', 'some_flag': 0}]
CodePudding user response:
If the list of dictionaries is not sorted, then we'd have to check every possible pair of elements for equality, which would take O(n^2)
time.
But if the list is sorted, then we could check each element with its next element, if they weren't equal, then there is no point checking that element with the following elements. On the other hand, if they were equal, we continue checking with the next one, and so on.
def is_duplicate(d_1, d_2):
return d_1['street'] == d_2['street'] and d_1['number'] == d_2['number']
def set_duplicate(d_1, d_2):
d_1['some_flag'] = 1
d_2['some_flag'] = 1
a = sorted(a, key=lambda k: (k['street'].lower(), k['number']))
cur_index = 0
while cur_index < len(a):
next_index = cur_index 1
while next_index < len(a) and is_duplicate(a[cur_index], a[next_index]):
set_duplicate(a[cur_index], a[next_index])
next_index = 1
cur_index = 1
print(a)
CodePudding user response:
You can also achieve this without pandas with a little more code e.g. like this:
import copy
a = [
{'street': 'ocean drive', 'number': '1', 'some_flag': 0},
{'street': 'ocean drive', 'number': '3', 'some_flag': 0},
{'street': 'ocean drive', 'number': '4', 'some_flag': 0}, # duplicate street / number keys
{'street': 'ocean drive', 'number': '4', 'some_flag': 0}, # duplicate street / number keys
{'street': 'ocean drive', 'number': '4', 'some_flag': 0}, # duplicate street / number keys
{'street': 'ocean drive', 'number': '4', 'some_flag': 0}, # duplicate street / number keys
{'street': 'apple tree rd.', 'number': '3', 'some_flag': 0},
]
result = copy.deepcopy(a) #duplicate a to not override your input
last_entry = ""
for entry in result:
if last_entry == "": #skip first iteration to have two entries to compare
last_entry = entry
continue
if last_entry["street"] == entry["street"] and last_entry["number"] == entry["number"]:
last_entry["some_flag"] = 1
entry["some_flag"] = 1
last_entry = entry
print(result)
CodePudding user response:
du = {}
for d in a:
new_key = d['street'] "_" d['number']
if new_key in du.keys():
du[new_key] = du[new_key] 1
else:
du[new_key] = 1
print(du)
# {'ocean drive_1': 1, 'ocean drive_3': 1, 'ocean drive_4': 2, 'apple tree rd._3': 1}
for k, v in du.items():
if v > 1:
k1 = k.split("_")[0]
k2 = k.split("_")[1]
for d in a:
if d['street'] == k1 and d['number'] == k2:
d['some_flag'] = 1
print(a)
# [{'street': 'ocean drive', 'number': '1', 'some_flag': 0}, {'street': 'ocean drive', 'number': '3', 'some_flag': 0}, {'street': 'ocean drive', 'number': '4', 'some_flag': 1}, {'street': 'ocean drive', 'number': '4', 'some_flag': 1}, {'street': 'apple tree rd.', 'number': '3', 'some_flag': 0}]