Working on a problem comparing two lists of dictionaries,
a = [{"colA":"red", "colB":"red", "colC":1},{"colA":"grape", "colB":"orange", "colC":4},{"colA":"tan", "colB":"mustard", "colC":3}]
b = [{"colA":"red", "colB":"red", "colC":1},{"colA":"red", "colB":"red", "colC":1},{"colA":"red", "colB":"red", "colC":1, "colD": 3}]
what's an efficient way to compare the two lists to see how many dictionaries in "a" match dictionaries in "b"? (I might have 1 million dictionaries in the list)
2.) I want to check for one list, how many duplicate dictionaries there are within that one list?
CodePudding user response:
maybe try this, this is for exact match, for partial match you need to modify the dictionary matching function
a = [{"colA":"red", "colB":"red", "colC":1},{"colA":"grape", "colB":"orange", "colC":4},{"colA":"tan", "colB":"mustard", "colC":3}]
b = [{"colA":"red", "colB":"red", "colC":1},{"colA":"red", "colB":"red", "colC":1},{"colA":"red", "colB":"red", "colC":1, "colD": 3}]
modified_a = {}
def modifiy(data):
result = {}
for i in data:
key = sorted(i.keys())
values = []
for k in key:
values.extend([k, i[k]])
values = tuple(values)
print(values)
if values not in result:
result[values]=0
result[values] =1
return result
modified_a = modifiy(a)
modified_b =modifiy(b)
common = sum(min(modified_a[i], modified_b[i]) for i in modified_a if i in modified_b)
print(common)
CodePudding user response:
Python sets
are a feasible way to solve this problem. Convert each list of dictionaries into a Python set formed by tuples (has to be tuples, since sets can't unhash the dict_items
object Python creates when applying the function items()
to a dictionary)
set_a = {tuple(dict_.items()) for dict_ in a}
set_b = {tuple(dict_.items()) for dict_ in b}
To see the dictionaries of a
that are in b
(dictionaries in the form of a tuple of tuples):
set_a.intersection(set_b)
To check how many duplicates are within one list:
len(a) - len(set_a)
Sets do not store repeated entries, if there is any repeated item in a
, the difference is going to be greater than 0
CodePudding user response:
Based on the information given, here's an answer (albeit primitive) that I put together.
a = [
{ "colA": "red", "colB": "red", "colC": 1 },
{ "colA": "grape", "colB": "orange", "colC": 4 },
{ "colA": "tan", "colB": "mustard", "colC": 3 }
]
b = [
{ "colA": "red", "colB": "red", "colC": 1 },
{ "colA": "red", "colB": "red", "colC": 1 },
{ "colA": "red", "colB": "red", "colC": 1, "colD": 3}
]
a_to_b_matches: list = []
for entry in a:
if(entry in b):
a_to_b_matches.append(entry)
a_list_dict_duplicates: list = []
a_temp: list = []
for entry in a:
if(entry in a_temp):
a_list_dict_duplicates.append(entry)
else:
a_temp.append(entry)
b_list_dict_duplicates: list = []
b_temp: list = []
for entry in b:
if(entry in b_temp):
b_list_dict_duplicates.append(entry)
else:
b_temp.append(entry)
CodePudding user response:
I think if your data is extremely huge, using pandas
is a good idea:
df_a = pd.DataFrame(a)
df_b = pd.DataFrame(b)
cols = list(set(df_a.columns.values) & set(df_b.columns.values))
df_a[cols].apply(tuple, axis=1).isin(df_b[cols].apply(tuple, axis=1))