Python Split Dictionary into Smaller Dictionaries by Threshold Value-CodePudding

Given the following dictionary:

dict_sports = {'Skateboarding': 163,
 'Bowling': 134,
 'Rugby': 83,
 'Wrestling': 57,
 'Baseball': 30,
 'Badminton': 24,
 'Volleyball': 22,
 'Basketball': 18,
 'Football': 16,
 'Weightlifting': 15,
 'Golf': 14,
 'Skiing': 10,
 'Boxing': 8,
 'Cricket': 6}

total_sum = sum([v for k,v in dict_sports.items()])

print(total_sum)

I want to split the dictionary into 3 smaller dictionaries; so that the following condition is met:

The three dictionaries' total sum of their respective values; should be near a certain threshold; say within a percentage plus/minus %2.5 of 200.

i.e. Total sum of each dictionary's values should be:

195 <= total_sum <= 205

Here is my solution, which solves my original problem, but still contradicts the condition I put above, which could be because of the data itself (cannot be split with that accuracy)... My problem is not with the accuracy of the split; my problem is with the amount of coding I have to do for this simple task!

list_01 = []
list_02 = []
list_03 = []

dict_01 = {}
dict_02 = {}
dict_03 = {}

for k,v in dict_sports.items():
    if  (sum(list_01) < 195) and (k not in dict_02) and (k not in dict_03):
        list_01.append(v)
        dict_01[k] = v
        if sum(list_01) > 205:
            list_01.pop()
            dict_01.pop(k, None)
            
print(sum(list_01))            

for k,v in dict_sports.items():
    if  (sum(list_02) < 195) and (k not in dict_01) and (k not in dict_03):
        list_02.append(v)
        dict_02[k] = v
        if sum(list_02) > 205:
            list_02.pop()
            dict_02.pop(k, None)
            
print(sum(list_02))            
            
for k,v in dict_sports.items():
    if  (sum(list_03) < 195) and (k not in dict_01) and (k not in dict_02):
        list_03.append(v)
        dict_03[k] = v
        if sum(list_03) > 205:
            list_03.pop()
            dict_03.pop(k, None)

print(sum(list_03))

print(dict_01)
print(dict_02)
print(dict_03)

I would imagine that there is a much better way to approach this problem, maybe with pandas, or any other python library or a third-party library; or a better code than this!

CodePudding user response：

First of all, what you are describing here is very close to (or is ?) the Multiple knapsack problem. There are a numerous way to solve this problem, depending on what conditions you impose on the results.

I recommended you read this page and the resources associated with it to better frame your desired output. For example, can we omit items ? What if we cannot satisfy your constraint on the results (within [195,205]) with the current list of items ?

Using pure python, a naive approach to reduce code amount using a greedy approach could be (given that your dictionary is sorted in descending order):

# Initialize a list of dictionaries to fill
dicts = [{},{},{}]
# Define a counter
i = 0
# Define your maximum value
max_value = 205


for k,v in dict_sports.items():
    for i in range(len(dicts)):
        if sum(dicts[i].values())   v < max_value:
            dicts[i].update({k:v})
            break
    i =1

for d in dicts:
    print("---- {} ----".format(sum(d.values())))
    print(d)

Giving

---- 203 ----
{'Skateboarding': 163, 'Baseball': 30, 'Skiing': 10}
---- 199 ----
{'Bowling': 134, 'Wrestling': 57, 'Boxing': 8}
---- 198 ----
{'Rugby': 83, 'Badminton': 24, 'Volleyball': 22, 'Basketball': 18, 'Football': 16, 'Weightlifting': 15, 'Golf': 14, 'Cricket': 6}

Note that this solution may skip items if the sum condition is not met for all dictionnary.

CodePudding user response：

When looking closely, the bodies of the three for loops differ only in their variable names. Otherwise they implement the same algorithm. This duplication of code can be eliminated by abstracting the algorithm into a function instead of repeating it three times.

Rather than just showing the end result I think it will be more useful to outline a possible process to get there.

Let's start by simply wrapping the duplicated code in a function:

list_01 = []
list_02 = []
list_03 = []

dict_01 = {}
dict_02 = {}
dict_03 = {}

def split_dict():
    for k, v in dict_sports.items():
        if  (sum(list_01) < 195) and (k not in dict_02) and (k not in dict_03):
            list_01.append(v)
            dict_01[k] = v
            if sum(list_01) > 205:
                list_01.pop()
                dict_01.pop(k, None)

split_dict()
            
print(sum(list_01))    
#...

Now we don't want our function to modify global variables but return computed results:

def split_dict():
    list_01 = []
    dict_01 = {}
    for k, v in dict_sports.items():
        if  (sum(list_01) < 195) and (k not in dict_02) and (k not in dict_03):
            list_01.append(v)
            dict_01[k] = v
            if sum(list_01) > 205:
                list_01.pop()
                dict_01.pop(k, None)
    return list_01, dict_01

list_01, dict_01 = split_dict()

We need to change dict_02 and dict_03 in the if condition if we want to reuse the function, so let's turn them into parameters:

def split_dict(dict_02, dict_03):
    # ... unchanged code omitted for brevity
    return list_01, dict_01

list_01, dict_01 = split_dict(dict_02, dict_03)

It's bad practice to have the same variable names locally and globally, also the function should work in different situations. So let's clean up and use more generic and descriptive names:

def split_dict(exclude_01, exclude_02):
    selected_values = []
    selected_items = {}
    for k, v in dict_sports.items():
        if  (sum(selected_values) < 195) and (k not in exclude_01) and (k not in exclude_02):
            selected_values.append(v)
            selected_items[k] = v
            if sum(selected_values) > 205:
                selected_values.pop()
                selected_items.pop(k, None)
    return selected_values, selected_items

list_01, dict_01 = split_dict(dict_02, dict_03)

Finally, we can use the function in all three places simply by using different input and output variables:

list_01, dict_01 = split_dict({}, {})            
print(sum(list_01))            
            
list_02, dict_02 = split_dict(dict_01, {})            
print(sum(list_02))            

list_03, dict_03 = split_dict(dict_01, dict_02)    
print(sum(list_03))

At this point, the amount of code is reduced by a factor close to 3. It also has the advantage that if you want to further tweak the algorithm there is only one place to change.

A few suggestions for next steps to take:

Try to combine exclude_01 and exclude_02 into a single function parameter
Change the hard coded numeric constants into function parameters
Avoid subsequent append/pop by checking the sum first
Come up with a new algorithm that satisfies the constraints even on later splits
...