I have a list of dictonaries:
[
{'name' : 'product a', 'sku': 'p100', 'price: 1.2},
{'name' : 'product x', 'sku': 'p120', 'price: 1.3},
{'name' : 'product x', 'sku': 'p100', 'price: 2.2},
{'name' : 'product a', 'sku': 'p100', 'price: 4.1},
{'name' : 'product a', 'sku': 'p20', 'price: 1.3},
{'name' : 'product a', 'sku': 'p20', 'price: 2.2}]
And I want to find duplicate and unique values of sku and return them as two new lists. (Duplicate in my case means 3 or more times. Triplicate)
This is a working solution:
def find_dict_duplicates(lines, key, times=2):
duplicate = []
unique = []
for line in lines:
count = 0
for l in lines:
if line[key] == l[key]:
count = 1
if count > times:
duplicates.append(line)
else:
unique.append(line)
return duplicate, unique
Results:
duplicates =
[
{'name' : 'product a', 'sku': 'p100', 'price: 1.2},
{'name' : 'product a', 'sku': 'p100', 'price: 2.2},
{'name' : 'product a', 'sku': 'p100', 'price: 4.1}]
unique =
[
{'name' : 'product a', 'sku': 'p120', 'price: 1.3},
{'name' : 'product a', 'sku': 'p20', 'price: 1.3},
{'name' : 'product a', 'sku': 'p20', 'price: 2.2}] # The critiria is more than 2 times to be consider duplicate so this is correct
But is super slow and ugly. Is slow because if a list contains 50.000 products then the comparison needed is 50.000^2 = 2.500.000.000 (5 minutes waiting) And ugly since this is more like c than python.
Can you sudgest a better way?
CodePudding user response:
Lol i got so Into it that i Spent like 4 Hours Trying To Get The Best Speed as i Could, Btw Here's the Solution
Because There's alot going on I recommend First Trying to figure out what's happening by visualizing in a debugger
from threading import Thread, activeCount
def find_dict_duplicates(lines, key, times=2):
duplicate = []
unique = []
added_values = {}
add_to_list = lambda add_into, index_of_items_to_add: [
add_into.append(lines[ind]) for ind in index_of_items_to_add
]
# Loops Through All The lines and saves Them with their index and Value
values = {ind: line[key] for ind, line in enumerate(lines)}
# Loops Through the index and values line by line created above
for ind, val in values.items():
# Adds the current Index to the added_values dict
# and stores them like this item: [indexes]
if val not in added_values:
added_values[val] = [ind]
# If the item was already added in the dict we just append the new index to it
else:
added_values[val].append(ind)
# Now loops Through all The Indexes Created By Above Loop
for item in added_values.values():
if len(item) > times:
list_to_add_to = duplicate
else:
list_to_add_to = unique
"""
Starts Doing the Heavy Work In Thread,
So It does not stop this Loop,
Will Start a New Thread Everytime It Comes To This line,
But Does not stop the Previous started Thread
"""
# Does some checking above and Adds the current Item to the correct dictionary
Thread(
target=add_to_list, args=(list_to_add_to, item)
).start()
# Waits For all Of the Threads To Finish
while activeCount() > 1:
pass
else:
print("Threads Ended")
return duplicate, unique
CodePudding user response:
You can iterate through the dictionary once and maintain another dictionary which stores count of each value. Then you can use the count dictionary to get the unique and duplicate entries. Its complexity is O(n).