Home > database >  The minimum number of rows that sum to a given number in Python
The minimum number of rows that sum to a given number in Python

Time:03-14

I have a DataFrame with cash, inflows and outflows. I need to create a feature survival that is the maximum number of periods the cash is enough to pay the projection of outflows (excluding the inflows from the computation).

Let's take an example from the table below.

(Again, the inflows do not count in this exercise).
In t=1, from the starting cash = 100, I can add the outflows: -20, -50, -10, -10 e still having a positive cash (100-20-50-10-10 = 10 > 0) while with the outflow in t=5 the cash would be negative. So, as long as I can "survive" 4 periods in t=1 the survival = 4.
In t=2 the survival = 3 and so on.

As it is a big DataFrame, how can I do it efficiently with Pandas?

t cash outflow inflow survival
1 100 -20 10 4
2 90 -50 10 3
3 50 -10 80 2
4 120 -10 70 ...
5 40 -50 60 ...

CodePudding user response:

I would do like this:

df['survival'] = [ (cash   df.iloc[i:].outflow.cumsum()>0).sum() for i,cash in enumerate(df.cash)]

Output:

   t  cash  outflow  survival
0  1   100      -20         4
1  2    90      -50         3
2  3    50      -10         2
3  4   120      -10         2
4  5    40      -50         0

Explanation: I make a loop on cash values keeping also track of the row number using enumerate. I use the row number to select only the portion of the dataframe from the current value of cash going down. On this portion of the dataframe I make a cumulative sum and then I add it to the cash. This yields a series which is negative when the cash is smaller than the sum of the previous outflows. I then set it >0 so I have True values when it's positive and False when it's negative. I finally sum the whole series, each True counts as a 1, so the output is the survival number you are looking for. Hope it makes sense.

CodePudding user response:

With your sample data :

df = pd.DataFrame({
    't': [1, 2, 3, 4, 5],
    'cash': [100, 90, 50, 120, 40],
    'outflow': [-20, -50, -10, -10, -50]
})

I choose to use the pandas apply() function on this function with x being the evaluated row and df the complete DataFrame :

def survival(x, df):
    cash = x['cash']
    i = 0
    while cash > 0:
        try:
            cash = cash   df.loc[x.name   i]['outflow']
            i  = 1
        except KeyError:
            print('End of dataframe')
            i  = 1
            cash = -1  # To make sure we leave the loop
    return i - 1

Then apply it to every row :

df['survival'] = df.apply(survival, args=(df,), axis=1)

# Output

   t  cash  outflow  survival
0  1   100      -20         4
1  2    90      -50         3
2  3    50      -10         2
3  4   120      -10         2
4  5    40      -50         0

CodePudding user response:

Creating the test dataframe

import pandas as pd
import numpy as np

N = 50
cash = 50 # the initial cash

# I will not type your dataframe
df = pd.DataFrame({'inflow': np.random.randint(1,10, N),
                   'outflow': np.random.randint(1, 20, N)})

Then the solution could be achieved with

# computes the cash for each period
ccash = (cash   (df['inflow'] - df['outflow']).cumsum())
survival = (ccash[::-1] >= 0).cumsum()[::-1]
  • Related