Home > Software engineering >  Merging list of dictionaries to remove all duplicates
Merging list of dictionaries to remove all duplicates

Time:10-15

I'm trying to get a simple Python code to merge a list of dictionaries into a condensed list as I have lots of duplicates atm.

From this:

[
    {
      "module": "RECEIPT BISCUITS",
      "product_range": "ULKER BISCUITS",
      "receipt_category": "BISCUITS"
    },
    {
      "module": "RECEIPT BISCUITS",
      "product_range": "ULKER",
      "receipt_category": "BISCUITS"
    },
    {
        "module": "RECEIPT BISCUITS",
        "product_range": "ULKER BISCUITS GOLD",
        "receipt_category": "BISCUITS GOLD"
    },
    {
        "module": "RECEIPT COFFEE",
        "product_range": "BLACK GOLD",
        "receipt_category": "BLACK GOLD"
    }
]

To this:

[
    {
      "module": "RECEIPT BISCUITS",
      "product_range": ["ULKER BISCUITS", "ULKER"],
      "receipt_category": ["BISCUITS", "BISCUITS GOLD"]
    },
    {
        "module": "RECEIPT COFFEE",
        "product_range": ["BLACK GOLD"],
        "receipt_category": ["BLACK GOLD"]
    }
]

Where the module is used to sort between them and the other 2 will be stored as a list even if there's only one value. This is JSON format btw.

CodePudding user response:

collections.defaultdict to the rescue for your data regrouping needs!

import collections

data = [
    {"module": "RECEIPT BISCUITS", "product_range": "ULKER BISCUITS", "receipt_category": "BISCUITS"},
    {"module": "RECEIPT BISCUITS", "product_range": "ULKER", "receipt_category": "BISCUITS"},
    {"module": "RECEIPT BISCUITS", "product_range": "ULKER BISCUITS GOLD", "receipt_category": "BISCUITS GOLD"},
    {"module": "RECEIPT COFFEE", "product_range": "BLACK GOLD", "receipt_category": "BLACK GOLD"},
]

grouped = collections.defaultdict(lambda: collections.defaultdict(list))
group_key = "module"

for datum in data:
    datum = datum.copy()  # Copy so we can .pop without consequence
    group = datum.pop(group_key)  # Get the key (`module` value)
    for key, value in datum.items():  # Loop over the rest and put them in the group
        grouped[group][key].append(value)

collated = [
    {
        group_key: group,
        **values,
    }
    for (group, values) in grouped.items()
]

print(collated)

prints out

[
  {'module': 'RECEIPT BISCUITS', 'product_range': ['ULKER BISCUITS', 'ULKER', 'ULKER BISCUITS GOLD'], 'receipt_category': ['BISCUITS', 'BISCUITS', 'BISCUITS GOLD']},
  {'module': 'RECEIPT COFFEE', 'product_range': ['BLACK GOLD'], 'receipt_category': ['BLACK GOLD']}
]

Note that this doesn't deduplicate the values within product_range, since I wasn't sure whether the order of the values is important for you, and so whether to use sets (which do not retain order).

Changing list to set and append to add will make the values unique.

  • Related