How do I compare two lists of dictionaries?-CodePudding

Working on a problem comparing two lists of dictionaries,

a = [{"colA":"red", "colB":"red", "colC":1},{"colA":"grape", "colB":"orange", "colC":4},{"colA":"tan", "colB":"mustard", "colC":3}]  
b =  [{"colA":"red", "colB":"red", "colC":1},{"colA":"red", "colB":"red", "colC":1},{"colA":"red", "colB":"red", "colC":1, "colD": 3}]

what's an efficient way to compare the two lists to see how many dictionaries in "a" match dictionaries in "b"? (I might have 1 million dictionaries in the list)

2.) I want to check for one list, how many duplicate dictionaries there are within that one list?

CodePudding user response：

maybe try this, this is for exact match, for partial match you need to modify the dictionary matching function

a = [{"colA":"red", "colB":"red", "colC":1},{"colA":"grape", "colB":"orange", "colC":4},{"colA":"tan", "colB":"mustard", "colC":3}]  
b =  [{"colA":"red", "colB":"red", "colC":1},{"colA":"red", "colB":"red", "colC":1},{"colA":"red", "colB":"red", "colC":1, "colD": 3}]

modified_a = {}


def modifiy(data):
    result = {}
    for i in data:
        key = sorted(i.keys())
        values = []
        for k in key:
            values.extend([k, i[k]])
        values = tuple(values)
        print(values)
        if values not in result:
            result[values]=0
        
        result[values] =1
    return result


modified_a = modifiy(a)
modified_b =modifiy(b)

common = sum(min(modified_a[i], modified_b[i]) for i in modified_a if i in modified_b)
print(common)

CodePudding user response：

Python sets are a feasible way to solve this problem. Convert each list of dictionaries into a Python set formed by tuples (has to be tuples, since sets can't unhash the dict_items object Python creates when applying the function items() to a dictionary)

set_a = {tuple(dict_.items()) for dict_ in a}
set_b = {tuple(dict_.items()) for dict_ in b}

To see the dictionaries of a that are in b (dictionaries in the form of a tuple of tuples):

set_a.intersection(set_b)

To check how many duplicates are within one list:

len(a) - len(set_a)

Sets do not store repeated entries, if there is any repeated item in a, the difference is going to be greater than 0

CodePudding user response：

Based on the information given, here's an answer (albeit primitive) that I put together.

a = [
        { "colA": "red", "colB": "red", "colC": 1 },
        { "colA": "grape", "colB": "orange", "colC": 4 },
        { "colA": "tan", "colB": "mustard", "colC": 3 }
    ]  
b =  [
        { "colA": "red", "colB": "red", "colC": 1 },
        { "colA": "red", "colB": "red", "colC": 1 },
        { "colA": "red", "colB": "red", "colC": 1, "colD": 3}
    ]

a_to_b_matches: list = []
for entry in a:
    if(entry in b):
        a_to_b_matches.append(entry)

a_list_dict_duplicates: list = []
a_temp: list = []
for entry in a:
    if(entry in a_temp):
        a_list_dict_duplicates.append(entry)
    else:
        a_temp.append(entry)

b_list_dict_duplicates: list = []
b_temp: list = []
for entry in b:
    if(entry in b_temp):
        b_list_dict_duplicates.append(entry)
    else:
        b_temp.append(entry)

CodePudding user response：

I think if your data is extremely huge, using pandas is a good idea:

df_a = pd.DataFrame(a)
df_b = pd.DataFrame(b)
cols = list(set(df_a.columns.values) & set(df_b.columns.values))
df_a[cols].apply(tuple, axis=1).isin(df_b[cols].apply(tuple, axis=1))