Home > database >  Get list of unique dictionaries while accumulating the count of their attributes
Get list of unique dictionaries while accumulating the count of their attributes

Time:06-09

  duplicate_array= [
        {'id': 1, 'name': 'john', 'count': 1},
        {'id': 1, 'name': 'john', 'count': 2},
        {'id': 2, 'name': 'peter', 'count': 1},
    ]

How can I obtain a list of unique dictionaries, removing the duplicates while accumulating the 'count' of the duplicates?

[
    {'id': 1, 'name': 'john', 'count': 3},   //here is main use case that I want to get total count 
    {'id': 2, 'name': 'peter', 'count': 1},
]

I tried this to get the unique values, but am not sure how to accumulate the results?

final = list({v['id']:v for v in duplicate_array}.values())

CodePudding user response:

Here is some code that does not use any python library. This does however cause the code to be longer.

duplicate_array= [
    {'id': 1, 'name': 'john', 'count': 1},
    {'id': 1, 'name': 'john', 'count': 2},
    {'id': 2, 'name': 'peter', 'count': 1},
]
final=[]

for i, x in enumerate(duplicate_array):
    count = 0
    
    for d in duplicate_array.copy():
        if d != 0 and d["id"] == x["id"] and d["name"] == x["name"]:
            count  = d["count"]
            duplicate_array.remove(d)
            
    duplicate_array.insert(i, 0)
    x["count"] = count
    final.append(x)

In the first block of code, whe are defining the original list and initialising our output list.

Then we have the for loop.

First, we are initialising count to 0. Then we loop through the list again to find all dictionaries which have the same id and name as the current dictionary. If they do, we add count up with their count value and remove them from the list. We also check whether the dictionary is nonzero, because we are adding zeros to the array later on. This prevents the program from crashing.

We insert a zero at the current position in the list, in order to prevent python from skipping the next item. For loops keep up a counter for at which item they are in python. However, when we are deleting the current item (which we did in the nested for loop), this counter won't match the correct item anymore, because all next items are shifted one to the left. By inserting a zero in the original list, we shift all items back and make the index correct again.

Finally, we set the original dictionary's count to the value we just calculated and we append the unique dictionary to our final list.

After this code, duplicate_array will be filled with zeros. If you don't want this, you can copy the list with duplicate_array.copy() first.

CodePudding user response:

So if you are OK with using Pandas, this could work:

import pandas as pd

results = pd.DataFrame(duplicate_array).groupby(["id", "name"]).agg("sum").reset_index().to_dict(orient="records")

The disadvantage is that you are using a rather large library, but I think that the readability is great this way.

CodePudding user response:

It is probably best to encapsulate this in a function of its own:

def dedupe(dt: list) -> list:
    dx = dict() 
    for item in dt:
        key_id = (item.get('id'), item.get('name'))  # We assume that an id name is a unique identity
        current = dx.get(key_id, {
            'id': item.get('id'),
            'name': item.get('name'),
        }  # get() lets us provide a default value if it doesn't exist yet
        current['count'] = current.get('count', 0)   item.get('count', 0)  # update the current count with the count from the new item.
        dx[key_id] = current  # Update the result dictionary
    
    return [d for _, d in dx]  # Convert back to a list 

duplicate_array = [
        {'id': 1, 'name': 'john', 'count': 1},
        {'id': 1, 'name': 'john', 'count': 2},
        {'id': 2, 'name': 'peter', 'count': 1},
     ]
result = dedupe(duplicate_array)

This leverages several common python features:

  • Tuples that are hashable can be used as keys in a dictionary.
  • We can use get to provide a default value, in this case an 'initialization' value. The first time we see a unique key (it does not exist already in our dictionary), we provide this value. Then we add the count in from our duplicated array.
  • Because we are using a dictionary to accumulate results, we can leverage the unique keys to deduplicate the array. All it takes at the end is to take the values of the dictionary as our new array.

Note that key_id can simply be id of the dictionary, rather than a combination of id and name. This should complete in O(2n), or effectively O(n). You're passing over your initial list only once, and then over your result list once (which, if there are duplicates, will be smaller). This second pass can be skipped if you're happy to work with a dictionary instead of a list.

Another way to do this is in two steps, where we first get all the unique ids, and then accumulate all the count of those ids:

st = {(item.get('id'), item.get('name')) for item in duplicate_array}
ls = [{'id': id, 'name': name, 'count': sum(item.get('count') for item in duplicate_array if item.get('id') == id)} for id, name in st]

This yields your result:

>>> st = {(item.get('id'), item.get('name')) for item in duplicate_array}
>>> ls = [{'id': id, 'name': name, 'count': sum(item.get('count') for item in duplicate_array if item.get('id') == id)} for id, name in st]
>>> ls
[{'id': 2, 'name': 'peter', 'count': 1}, {'id': 1, 'name': 'john', 'count': 3}]

This is more compact, but a little harder to unpack. The first pass (st = ...) is creating a set of tuples, similar to the first option. The second pass is creating an array of dictionaries, where each dictionary passes over the original array looking for values it should accumulate into count.

I do think this will be slower over very large sets, since each new dictionary creation in ls =... passes over the whole array. In the worst case, where you have no duplicates, this means an O(n^2). But if you're looking for compactness, there you go.

  • Related