Home > Back-end >  how to make col2 the cumsum of the second col1 == 'x' in DateTime per group?
how to make col2 the cumsum of the second col1 == 'x' in DateTime per group?

Time:10-18

I would like a column in a pandas dataframe that

  • counts the number of times 'outcome2' is observed in 'value' through 'datetime'
  • starting from the second observation of 'outcome2'
  • per 'ID' or df.index
import pandas as pd
from io import StringIO
import datetime 

txt= """
ID,datetime,value
A,12/10/2022 10:00:00,outcome1
A,12/10/2022 11:15:10,outcome2
A,14/10/2022 15:30:30,outcome1
B,11/10/2022 11:30:22,outcome1
B,15/10/2022 22:44:11,outcome2
B,15/10/2022 23:30:22,outcome3
B,15/10/2022 23:31:11,outcome2
"""

df = pd.read_csv(StringIO(txt),\
             parse_dates=[1],\
             dayfirst=True)\
             .assign(id_index= lambda x_df: x_df\
             .groupby('ID', sort=False).ngroup())\
             .set_index("id_index")\
             .rename_axis(index=None)

df = df.assign(value_test = lambda df: df['value']=='outcome2',\
               value_cumsum= lambda df: df.groupby('ID', sort=False)['value_test'].cumsum())

  ID            datetime     value value_test   value_cumsum
0  A 2022-10-12 10:00:00  outcome1      False              0
0  A 2022-10-12 11:15:10  outcome2       True              1
0  A 2022-10-14 15:30:30  outcome1      False              1
1  B 2022-10-11 11:30:22  outcome1      False              0
1  B 2022-10-15 22:44:11  outcome2       True              1
1  B 2022-10-15 23:30:22  outcome3      False              1
1  B 2022-10-15 23:31:11  outcome2       True              2

I tried assigning a third variable to df using if-statements in the lambda functions. It failed in a way others have experienced 1:

df = df.assign(value_test = lambda df: df['value']=='outcome2',\
               value_cumsum = lambda df: df.groupby('ID', sort=False)['value_test'].cumsum(),\
               outcome2 = lambda df: 0 if df[df[value_cumsum]==1] or df[df[value_cumsum]==0]\
               else df[value_cumsum]-1 if df[df[value_cumsum] > 1]

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I need only the accumulated sum (running total) of counts of 'outcome2' in 'value' starting from the second observation of 'outcome2' per group.*

Any suggestions, please?

And is it possible to use lambda or without the intermediate step making value_test or value_cumsum?

desired df

  ID            datetime     value outcome2
0  A 2022-10-12 10:00:00  outcome1        0
0  A 2022-10-12 11:15:10  outcome2        0
0  A 2022-10-14 15:30:30  outcome1        0
1  B 2022-10-11 11:30:22  outcome1        0
1  B 2022-10-15 22:44:11  outcome2        0
1  B 2022-10-15 23:30:22  outcome3        0
1  B 2022-10-15 23:31:11  outcome2        1

CodePudding user response:

You can use:

df['value_cumsum'] = (df.groupby('ID')['value_test']
                      .cumsum().sub(1).where(df['value_test'], 0)
                      )

Or, if you also want to label the False:

df['value_cumsum'] = (df.groupby('ID')['value_test']
                      .cumsum().sub(1).clip(lower=0)
                      )

output:

  ID            datetime     value  value_test  value_cumsum
0  A 2022-10-12 10:00:00  outcome1       False             0
0  A 2022-10-12 11:15:10  outcome2        True             0
0  A 2022-10-14 15:30:30  outcome1       False             0
1  B 2022-10-11 11:30:22  outcome1       False             0
1  B 2022-10-15 22:44:11  outcome2        True             0
1  B 2022-10-15 23:30:22  outcome3       False             0
1  B 2022-10-15 23:31:11  outcome2        True             1

without intermediate:

df['value_cumsum'] = (df['value'].eq('outcome2')
                      .groupby(df['ID'])
                      .cumsum().sub(1).clip(lower=0)
                     )

output:

  ID            datetime     value  value_cumsum
0  A 2022-10-12 10:00:00  outcome1             0
0  A 2022-10-12 11:15:10  outcome2             0
0  A 2022-10-14 15:30:30  outcome1             0
1  B 2022-10-11 11:30:22  outcome1             0
1  B 2022-10-15 22:44:11  outcome2             0
1  B 2022-10-15 23:30:22  outcome3             0
1  B 2022-10-15 23:31:11  outcome2             1
  • Related