how to label a series of sums until they turn positive again?-CodePudding

what is the most efficient approach to this problem? I have a df with users (users are not unique), a month column and a profit/loss column. I want to create a new binary column 'eligibility' where it is always 1 unless the following scenario: if a user shows a loss, they need to cover that amount in the next month(s) in order to be eligible, so for the first negative month they are 0 until they cover the sum

for instance using a dataset like this:

user,month,profit/loss
1,2022-03-01,400
2,2022-03-01,-100
3,2022-03-01,5000
1,2022-04-01,-10
2,2022-04-01,80
3,2022-04-01,400
1,2022-05-01,60
2,2022-05-01,60

new column would be

CodePudding user response：

Maybe you could loop through the dataset, and when a loss is found, map that loss to the corresponding user in any desired manner (for example a dictionary or a hashmap) and add a 0 to the column.

When a profit is found, check if that user is currently at a loss, if so, add the profit to the loss, and if the result of the addition is greater than 0, the record can be deleted and a 1 can be added to the column. If the user with the profit is not found to be currently a loss, just move on and add a 1 to the column.

I hope this helps.

CodePudding user response：

The logic is not fully clear, so assuming you want 1 when a user has an overall positive (or null) profit, and 0 if negative, you can use groupby.cumsum and compare to 0:

df['out'] = df.groupby('user')['profit/loss'].cumsum().gt(0).astype(int)

output:

   user       month  profit/loss  out
0     1  2022-03-01          400    1
1     2  2022-03-01         -100    0
2     3  2022-03-01         5000    1
3     1  2022-04-01          -10    1  # (*) see note below
4     2  2022-04-01           80    0
5     3  2022-04-01          400    1
6     1  2022-05-01           60    1
7     2  2022-05-01           60    1

# (*) this value is different from yours, but the overall gain is 390 (>0)

CodePudding user response：

import pandas as pd
df = pd.DataFrame([
    [1,'2022-03-01',400],
    [2,'2022-03-01',-100],
    [3,'2022-03-01',5000],
    [1,'2022-04-01',-10],
    [2,'2022-04-01',80],
    [3,'2022-04-01',400],
    [1,'2022-05-01',60],
    [2,'2022-05-01',60]
], columns=['user','month',])

data = (
     pd
    .concat([df.user, df.groupby('user')
    .agg({'profit/loss':'cumsum'})
    .rename(columns={'profit/loss':"cumsum"})], axis=1)
)

data["cummax"] = (
   data
    .groupby('user')
    .agg({'cumsum':'cummax'})
)
data.apply(lambda x: 0 if x['cumsum'] < 0 else 0 if x["cumsum"] < x["cummax"] else 1, axis=1)

output:

0    1
1    0
2    1
3    0
4    0
5    1
6    1
7    1
dtype: int64