Iterating through a data frame and aggregating sums-CodePudding

Hope you can help me.

I have the following dataframe named df (see table below)

I have two sales reps - sales_rep_1 and sales_rep_2.

I need to assign either of them to a a letter in the table. A to Z

For each row I would like to assign either sales_rep_1 or sales_rep_2 to a new column - df['Rep']

The only condition is that the sum of each column: DM Bookings, Revenue and AtL Opps needs to be split as evenly as possible between each sales rep.

Letter	DM Bookings	Revenue	AtL Opps
A	6.0	42506.0	34
B	2.0	21055.0	41
C	1.0	6307.0	36
D	0.0	8254.0	14
E	1.0	29878.0	38
F	0.0	6911.0	10
G	1.0	6735.0	19
H	0.0	0.0	80

Is there a way to do this? The result does not need to be perfect, nor exact

Any help would be great.

Thanks in advance

CodePudding user response：

This is an optimization problem. So, does scipy have a ready-made solution for this? Close, but not quite - it has linear programming solvers which are very general but not that easy to use.

With thanks to SO answer Linear sum assignment (SciPy) and balancing the costs we can use a linear programming solver to set up our conditions and solve for best worker assignment with two workers. Here we just adapt that answer by @joni.

We'll need to define a task cost so that we can base the solution on an optimized distribution of task cost between reps or "workers".

import pulp
import numpy as np

workers = ["A", "B"]
n_workers = len(workers)
n_tasks = len(df)

columns = ["Revenue", "AtL Opps", "DM Bookings"]

Rescale and weigh columns - matrix multiply by [1, 1/3, 1/9] here to give decreasing weight to the columns in the order they were given.

task_weights = (df[columns] / df[columns].max(axis=0))

## combine column weights by multiplying them by [1, 1/3, 1/9] to de-prioritize later columns

task_weight = task_weights @ 3.**(-np.arange(len(columns)))

Then define the constrained optimization problem

# actual cost matrix - one row per worker
# in our case, cost is the same regardless of which rep handles it
c = np.stack([task_weight] * len(workers), axis=0)

# create the model
mdl = pulp.LpProblem("even_assignment")

# decision variables
x = {}
for w in workers:
    for t in range(n_tasks):
        x[w, t] = pulp.LpVariable(f"x[{w}, {t}]", cat="Binary")

max_val = pulp.LpVariable("max_val", cat="Continous")
min_val = pulp.LpVariable("min_val", cat="Continous")

# objective: minimize the difference between the maximum and the minimum
#            costs per worker
mdl.setObjective(max_val - min_val)

# constraint: each task can only be assigned once
for task in range(n_tasks):
    mdl.addConstraint(sum(x[w, task] for w in workers) == 1)

# constraint: evenly distribute the tasks
for i_w, w in enumerate(workers):
    assignment_cost = sum(x[w, task] * c[i_w, task] for task in range(n_tasks))
    mdl.addConstraint(assignment_cost <= max_val)
    mdl.addConstraint(assignment_cost >= min_val)

# solve the problem
mdl.solve()

df_result = df.copy()

# Output
for i_w, w in enumerate(workers):
    worker_cost = sum(x[w, t].varValue*c[i_w, t] for t in range(n_tasks))
    print(f"scaled cost for rep {w}: {worker_cost:.2f}")
    worker_task = [x[w, t].varValue for t in range(n_tasks)]
    df_result[w] = worker_task
print()
df_result = (df_result.assign(Rep = lambda df: np.select([df[w] == 1 for w in workers], [w.upper() for w in workers], default=None))
             .drop(columns=workers))

df_result

scaled cost for rep A: 2.10
scaled cost for rep B: 2.09

  Letter  DM Bookings  Revenue  AtL Opps    Rep
0     A           6.0  42506.0        34      B
1     B           2.0  21055.0        41      A
2     C           1.0   6307.0        36      A
3     D           0.0   8254.0        14      B
4     E           1.0  29878.0        38      A
5     F           0.0   6911.0        10      A
6     G           1.0   6735.0        19      B
7     H           0.0      0.0        80      B

Evaluation time, how did we split?

pd.concat([df_result.groupby("Rep").size().rename("count"), df_result.groupby("Rep").sum()], axis=1)

     count  DM Bookings  Revenue  AtL Opps
Rep                                       
A        4          4.0  64151.0       125
B        4          7.0  57495.0       147

Different weights per column will adjust the split differently. For example using [1, 0.1, 0.01] would have prioritized Revenue more.

If balancing the number of items assigned to each rep is also wanted, then add a synthetic column with all ones and give it an appropriate priority and weight.