Home > Blockchain >  Pythonic way to find duplicate and unique values in a list of dictonaries
Pythonic way to find duplicate and unique values in a list of dictonaries

Time:02-11

I have a list of dictonaries:

[
{'name' : 'product a', 'sku': 'p100', 'price: 1.2},
{'name' : 'product x', 'sku': 'p120', 'price: 1.3},
{'name' : 'product x', 'sku': 'p100', 'price: 2.2},
{'name' : 'product a', 'sku': 'p100', 'price: 4.1}, 
{'name' : 'product a', 'sku': 'p20', 'price: 1.3},
{'name' : 'product a', 'sku': 'p20', 'price: 2.2}] 

And I want to find duplicate and unique values of sku and return them as two new lists. (Duplicate in my case means 3 or more times. Triplicate)

This is a working solution:

def find_dict_duplicates(lines, key, times=2):
    duplicate = [] 
    unique = []
    for line in lines:
        count = 0
        for l in lines:
            if line[key] == l[key]:
                count  = 1
        if count > times:
            duplicates.append(line)
        else:
            unique.append(line)
return duplicate, unique

Results:

duplicates =
[
{'name' : 'product a', 'sku': 'p100', 'price: 1.2},
{'name' : 'product a', 'sku': 'p100', 'price: 2.2},
{'name' : 'product a', 'sku': 'p100', 'price: 4.1}]

unique =
[
{'name' : 'product a', 'sku': 'p120', 'price: 1.3},
{'name' : 'product a', 'sku': 'p20', 'price: 1.3},
{'name' : 'product a', 'sku': 'p20', 'price: 2.2}]    # The critiria is more than 2 times to be consider duplicate so this is correct 

But is super slow and ugly. Is slow because if a list contains 50.000 products then the comparison needed is 50.000^2 = 2.500.000.000 (5 minutes waiting) And ugly since this is more like c than python.

Can you sudgest a better way?

CodePudding user response:

Lol i got so Into it that i Spent like 4 Hours Trying To Get The Best Speed as i Could, Btw Here's the Solution

Because There's alot going on I recommend First Trying to figure out what's happening by visualizing in a debugger

from threading import Thread, activeCount


def find_dict_duplicates(lines, key, times=2):
    duplicate = [] 
    unique = []
    added_values = {}

    
    add_to_list = lambda add_into, index_of_items_to_add: [
    add_into.append(lines[ind]) for ind in index_of_items_to_add
    ]   
    
    # Loops Through All The lines and saves Them with their index and Value
    values = {ind: line[key] for ind, line in enumerate(lines)}
    
    # Loops Through the index and values line by line created above
    for ind, val in values.items():
        # Adds the current Index to the added_values dict
        # and stores them like this item: [indexes]

        if val not in added_values:
            added_values[val] = [ind]

        # If the item was already added in the dict we just append the new index to it
        else:
            added_values[val].append(ind)
    
    # Now loops Through all The Indexes Created By Above Loop
    for item in added_values.values():
        if len(item) > times:
            list_to_add_to = duplicate
            
        else:
            list_to_add_to = unique
        
        """
           Starts Doing the Heavy Work In Thread, 
           So It does not stop this Loop, 
           Will Start a New Thread Everytime It Comes To This line, 
           But Does not stop the Previous started Thread
        """
        # Does some checking above and Adds the current Item to the correct dictionary
        Thread(
            target=add_to_list, args=(list_to_add_to, item)
        ).start()
    
    # Waits For all Of the Threads To Finish
    while activeCount() > 1: 
        pass
    else:
        print("Threads Ended")
    
    return duplicate, unique

CodePudding user response:

You can iterate through the dictionary once and maintain another dictionary which stores count of each value. Then you can use the count dictionary to get the unique and duplicate entries. Its complexity is O(n).

  • Related