Home > database >  How to groupby and get cumcount using a criteria?
How to groupby and get cumcount using a criteria?

Time:02-27

I have a dataframe like as shown below

df = pd.DataFrame(
    {'supplier_id':[1,1,1,1],
     'prod_id':[123,456,789,342],
     'country' : ['UK', 'UK', 'UK','US'],
     'transaction_date' : ['13/11/2020', '10/1/2018','11/11/2017', '27/03/2016'],
     'industry' : ['STA','STA','PSA','STA'],
     'segment' : ['testa','testb','testa','testc'],
     'label':[1,1,1,0]})

My objective is to find out answers to the below questions

a) from the current row, How many times prior (previously), the same supplier has succeeded and failed in the same country? (use supplier_id and country column). here column label = 1 means success and label=0 means failure

Similarly, I would like to compute the success and failure count based on industry, country and segment as well.

Note that 1st transaction will always starts with 0 because supplier will have no previous transactions associated with that column.

As we are looking at chronological order of business done, we need to first sort the dataframe based on transaction_date.

So, I tried the below

df.sort_values(by=['supplier_id','transaction_date'],inplace=True)
df['prev_biz_country_success_count'] = df.groupby(['supplier_id', 'country']).cumcount()
df['prev_biz_country_failure_count'] = df.groupby(['supplier_id', 'country']).cumcount()

but as you can see, am not sure how to include the label column value. Meaning, we need to count based on label=1 and label=0.

I expect my output to be like as shown below

enter image description here

CodePudding user response:

We can group the dataframe by supplier_id and country column then apply transformation function shift cumsum on label column to get the count of rows where the criteria in previous rows

g = df.groupby(['supplier_id', 'country'])

for criteria, label in dict(success=1, failure=0).items():
    df[f'prev_biz_country_{criteria}_count'] =\
        g['label'].apply(lambda s: s.eq(label).shift(fill_value=0).cumsum())

   supplier_id  prod_id country transaction_date industry segment  label prev_biz_country_success_count prev_biz_country_failure_count
1            1      456      UK        10/1/2018      STA   testb      1                              0                              0
2            1      789      UK       11/11/2017      PSA   testa      1                              1                              0
0            1      123      UK       13/11/2020      STA   testa      1                              2                              0
3            1      342      US       27/03/2016      STA   testc      0                              0                              0
  • Related