Pandas find subset of rows minimizing the sum of a column under other column constraint-CodePudding

I have this very simple idea of finding the subset of rows that minimizes the sum of a column, while the sum of another column has to be greater than a certain value.

Example:

df = pd.DataFrame({'Names': ['a', 'b', 'c', 'd', 'e', 'f'],
               'Target': [35, 15, 12, 8, 7, 5],
               'Cost': [15, 40, 30, 30, 25, 10]})




    Names   Target  Cost
0   a       35      15
1   b       15      40
2   c       12      30
3   d       8       30
4   e       7       25
5   f       5       10

In this example above, I would like to find the subset of rows that minimizes the Cost column, while the sum of Target has to be greater than 40.

In this example, the function I'm looking to build would return ['a', 'f'] because the constraint 35 5 >= 40 is satisfied and the Cost 15 10 = 25 can't be lower with any other combination of rows while satisfying the constraint.

What libraries or ideas am I looking for to resolve this problem?

CodePudding user response：

We can set this up as a constraint optimization problem which has three parts:

Creating the variable: We will represent our choice of row selection with a boolean vector. True in k-th entry mean we’ve selected that row k and a False will mean we don't.
Specifying the constraints: We need to insure the sum of the target rows is greater than the threshold. Done by computing the dot product between target column and vector of selected rows.
Specify an objective function. Here the objective function is the sum of the cost of the rows selected, which is the dot product of the cost column and selected rows vector.
Solve which in this case is to minimize the objective function subject to the constraints.

There are several Python Operations Research libraries i.e. OR Libraries for solving this type of problem. This solution uses Google OR-Tools which is an "open-source software suite for optimization.

We show that this solution is much faster than other solutions that pick the best rows by performing an exhaustive search over all possible row selections.

Code

import numpy as np 
import pandas as pd

# Google or-tools solver
from ortools.sat.python import cp_model

import timeit

def solve(df, threshold):
    '''
    Uses or-tools module to solve optimization

    '''
    weights = df['Target']
    cost = df['Cost']
    names = df['Names']

    # Creates the model.
    model = cp_model.CpModel()

    # Step 1: Create the variables
    # array containing row selection flags i.e. True if row k is selected, False otherwise
    # Note: treated as 1/0 in arithmeetic expressions
    row_selection = [model.NewBoolVar(f'{i}') for i in range(df.shape[0])]

    # Step 2: Define the constraints
    # The sum of the weights for the selected rows should be >= threshold
    model.Add(weights.dot(row_selection) >= threshold)

    # Step 3: Define the objective function
    # Minimize the total cost (based upon rows selected)
    model.Minimize(cost.dot(row_selection))

    # Step 4: Creates the solver and solve.
    solver = cp_model.CpSolver()
    solver.Solve(model)

    # Get the rows selected
    rows = [row for row in range(df.shape[0]) if solver.Value(row_selection[row])]

    return names[rows]


# Setup
df = pd.DataFrame({'Names': ['a', 'b', 'c', 'd', 'e', 'f'],
               'Target': [35, 15, 12, 8, 7, 5],
               'Cost': [15, 40, 30, 30, 25, 10]})

print(solve(df, 40))

# Output:
0    a
5    f
Name: Names, dtype: object

Performance

Current solution (based upon OR-Tools)

%timeit main(df, 40)
3.13 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Compare to exhaustive search algorithms such as Scott Boston solution.

from itertools import combinations, chain
    
df = pd.DataFrame(
        {
            "Names": ["a", "b", "c", "d", "e", "f"],
            "Target": [35, 15, 12, 8, 7, 5],
            "Cost": [15, 40, 30, 30, 25, 10],
        }
    )
    
    
def powerset(iterable):
        "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
        s = list(iterable)
        return chain.from_iterable(combinations(s, r) for r in range(len(s)   1))
    
    
%timeit min( (df.loc[list(i)] for i in powerset(df.index) if df.loc[list(i), "Target"].sum() >= 40), key=lambda x: x["Cost"].sum(),)

64.4 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Thus, using OR-Tools was ~20X faster than an exhaustive search (i.e. 3.13 ms vs. 64.4 ms)

CodePudding user response：

Start with module imports, dummy data, and desired threshold of 40.

import pandas as pd
import itertools

df = pd.DataFrame({'Names': ['a', 'b', 'c', 'd', 'e', 'f'],
               'Target': [35, 15, 12, 8, 7, 5],
               'Cost': [15, 40, 30, 30, 25, 10]})

max_value = 40 # greater than or equal to

Iterate over all target values, including row index.

numbers = [(index, target) for index, names, target, cost in df.itertuples()]
#print(numbers)

Select all combinations that exceed threshold AND sum of all values minus minimum value is less than threshold (avoid incorporating additional values).

# reference: https://stackoverflow.com/questions/34517540/find-all-combinations-of-a-list-of-numbers-with-a-given-sum
combinations = [(x[0] for x in seq) for i in range(len(numbers), 0, -1) \
                for seq in itertools.combinations(numbers, i) \
                if sum([x[1] for x in seq]) >= max_value and sum([x[1] for x in seq]) - min([x[1] for x in seq]) < max_value]
combinations = list(set(tuple(x) for x in combinations))
#print(combinations)

Calculate combination with minimum cost and return relevant values.

min_cost, min_index, min_name = None, None, None
for combination in combinations:
    sum_cost = sum([df["Cost"].loc[x] for x in combination])
    if min_cost == None or sum_cost < min_cost:
        min_cost, min_index, min_name = sum_cost, combination, [df["Names"].loc[x] for x in combination]
print("minimum cost:", min_cost)
#print("Index combination:", min_index) # if Names not unique
print("Names combination:", min_name)

Output

minimum cost: 25
Names combination: ['a', 'f']

CodePudding user response：

Let's use itertools recipe powerset:

import pandas as pd
import numpy as np
from itertools import combinations, chain

df = pd.DataFrame(
    {
        "Names": ["a", "b", "c", "d", "e", "f"],
        "Target": [35, 15, 12, 8, 7, 5],
        "Cost": [15, 40, 30, 30, 25, 10],
    }
)


def powerset(iterable):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)   1))


df_out = min(
    (
        df.loc[list(i)]
        for i in powerset(df.index)
        if df.loc[list(i), "Target"].sum() >= 40
    ),
    key=lambda x: x["Cost"].sum(),
)

df_out

Output:

  Names  Target  Cost
0     a      35    15
5     f       5    10

Details:

Use itertools recipe "powerset" to find all combinations or the index from zero to the length of the dataframe. Then, using list comprehension create a generator that on returns those combinations where the target sums to 40 or greater. Lastly, use python min function with key parameter to return the dataframe/combination with the lowest cost.

CodePudding user response：

This is a little bit of a brute force way to solve the problem, but it will work:

from itertools import combinations

df = pd.DataFrame({'Names': ['a', 'b', 'c', 'd', 'e', 'f'],
               'Target': [35, 15, 12, 8, 7, 5],
               'Cost': [15, 40, 30, 30, 25, 10]})

lst=[]
for x in range(2, (len(df.Target) 1)):
    lst =[y for y in list(combinations(list(zip(list(df.Target), list(df.Cost))),x))]
    
lst1=[]
count=-1
for x in lst:
    count =1
    v0=0
    v1=0
    for y in x:
        v0 =y[0]
        v1 =y[1]
    lst1.append((count,v0, v1))

df01=pd.DataFrame(lst1, columns=["ID", "target_sum", "cost_sum"])
ans=lst[df01[df01.target_sum>=40].cost_sum.idxmin()]

Answer=[df[(df.Target==ans[x][0]) & (df.Cost==ans[x][1])].Names.values[0] for x in range(0, len(ans))]

Output:

['a', 'f']