Home > Software design >  Iterating through a data frame and aggregating sums
Iterating through a data frame and aggregating sums

Time:07-22

Hope you can help me.

I have the following dataframe named df (see table below)

I have two sales reps - sales_rep_1 and sales_rep_2.

I need to assign either of them to a a letter in the table. A to Z

For each row I would like to assign either sales_rep_1 or sales_rep_2 to a new column - df['Rep']

The only condition is that the sum of each column: DM Bookings, Revenue and AtL Opps needs to be split as evenly as possible between each sales rep.

Letter DM Bookings Revenue AtL Opps
A 6.0 42506.0 34
B 2.0 21055.0 41
C 1.0 6307.0 36
D 0.0 8254.0 14
E 1.0 29878.0 38
F 0.0 6911.0 10
G 1.0 6735.0 19
H 0.0 0.0 80

Is there a way to do this? The result does not need to be perfect, nor exact

Any help would be great.

Thanks in advance

CodePudding user response:

This is an optimization problem. So, does scipy have a ready-made solution for this? Close, but not quite - it has linear programming solvers which are very general but not that easy to use.

With thanks to SO answer Linear sum assignment (SciPy) and balancing the costs we can use a linear programming solver to set up our conditions and solve for best worker assignment with two workers. Here we just adapt that answer by @joni.

We'll need to define a task cost so that we can base the solution on an optimized distribution of task cost between reps or "workers".

import pulp
import numpy as np

workers = ["A", "B"]
n_workers = len(workers)
n_tasks = len(df)

columns = ["Revenue", "AtL Opps", "DM Bookings"]

Rescale and weigh columns - matrix multiply by [1, 1/3, 1/9] here to give decreasing weight to the columns in the order they were given.

task_weights = (df[columns] / df[columns].max(axis=0))

## combine column weights by multiplying them by [1, 1/3, 1/9] to de-prioritize later columns

task_weight = task_weights @ 3.**(-np.arange(len(columns)))

Then define the constrained optimization problem

# actual cost matrix - one row per worker
# in our case, cost is the same regardless of which rep handles it
c = np.stack([task_weight] * len(workers), axis=0)

# create the model
mdl = pulp.LpProblem("even_assignment")

# decision variables
x = {}
for w in workers:
    for t in range(n_tasks):
        x[w, t] = pulp.LpVariable(f"x[{w}, {t}]", cat="Binary")

max_val = pulp.LpVariable("max_val", cat="Continous")
min_val = pulp.LpVariable("min_val", cat="Continous")

# objective: minimize the difference between the maximum and the minimum
#            costs per worker
mdl.setObjective(max_val - min_val)

# constraint: each task can only be assigned once
for task in range(n_tasks):
    mdl.addConstraint(sum(x[w, task] for w in workers) == 1)

# constraint: evenly distribute the tasks
for i_w, w in enumerate(workers):
    assignment_cost = sum(x[w, task] * c[i_w, task] for task in range(n_tasks))
    mdl.addConstraint(assignment_cost <= max_val)
    mdl.addConstraint(assignment_cost >= min_val)

# solve the problem
mdl.solve()

df_result = df.copy()

# Output
for i_w, w in enumerate(workers):
    worker_cost = sum(x[w, t].varValue*c[i_w, t] for t in range(n_tasks))
    print(f"scaled cost for rep {w}: {worker_cost:.2f}")
    worker_task = [x[w, t].varValue for t in range(n_tasks)]
    df_result[w] = worker_task
print()
df_result = (df_result.assign(Rep = lambda df: np.select([df[w] == 1 for w in workers], [w.upper() for w in workers], default=None))
             .drop(columns=workers))

df_result
scaled cost for rep A: 2.10
scaled cost for rep B: 2.09

  Letter  DM Bookings  Revenue  AtL Opps    Rep
0     A           6.0  42506.0        34      B
1     B           2.0  21055.0        41      A
2     C           1.0   6307.0        36      A
3     D           0.0   8254.0        14      B
4     E           1.0  29878.0        38      A
5     F           0.0   6911.0        10      A
6     G           1.0   6735.0        19      B
7     H           0.0      0.0        80      B

Evaluation time, how did we split?

pd.concat([df_result.groupby("Rep").size().rename("count"), df_result.groupby("Rep").sum()], axis=1)
     count  DM Bookings  Revenue  AtL Opps
Rep                                       
A        4          4.0  64151.0       125
B        4          7.0  57495.0       147

Different weights per column will adjust the split differently. For example using [1, 0.1, 0.01] would have prioritized Revenue more.

If balancing the number of items assigned to each rep is also wanted, then add a synthetic column with all ones and give it an appropriate priority and weight.

  • Related