I would like a column in a pandas
dataframe
that
- counts the number of times
'outcome2'
is observed in'value'
through'datetime'
- starting from the second observation of
'outcome2'
- per
'ID'
ordf.index
import pandas as pd
from io import StringIO
import datetime
txt= """
ID,datetime,value
A,12/10/2022 10:00:00,outcome1
A,12/10/2022 11:15:10,outcome2
A,14/10/2022 15:30:30,outcome1
B,11/10/2022 11:30:22,outcome1
B,15/10/2022 22:44:11,outcome2
B,15/10/2022 23:30:22,outcome3
B,15/10/2022 23:31:11,outcome2
"""
df = pd.read_csv(StringIO(txt),\
parse_dates=[1],\
dayfirst=True)\
.assign(id_index= lambda x_df: x_df\
.groupby('ID', sort=False).ngroup())\
.set_index("id_index")\
.rename_axis(index=None)
df = df.assign(value_test = lambda df: df['value']=='outcome2',\
value_cumsum= lambda df: df.groupby('ID', sort=False)['value_test'].cumsum())
ID datetime value value_test value_cumsum
0 A 2022-10-12 10:00:00 outcome1 False 0
0 A 2022-10-12 11:15:10 outcome2 True 1
0 A 2022-10-14 15:30:30 outcome1 False 1
1 B 2022-10-11 11:30:22 outcome1 False 0
1 B 2022-10-15 22:44:11 outcome2 True 1
1 B 2022-10-15 23:30:22 outcome3 False 1
1 B 2022-10-15 23:31:11 outcome2 True 2
I tried assigning a third variable to df
using if-statements in the lambda
functions. It failed in a way others have experienced 1:
df = df.assign(value_test = lambda df: df['value']=='outcome2',\
value_cumsum = lambda df: df.groupby('ID', sort=False)['value_test'].cumsum(),\
outcome2 = lambda df: 0 if df[df[value_cumsum]==1] or df[df[value_cumsum]==0]\
else df[value_cumsum]-1 if df[df[value_cumsum] > 1]
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I need only the accumulated sum (running total) of counts of 'outcome2'
in 'value'
starting from the second observation of 'outcome2'
per group.*
Any suggestions, please?
And is it possible to use lambda
or without the intermediate step making value_test
or value_cumsum
?
desired df
ID datetime value outcome2
0 A 2022-10-12 10:00:00 outcome1 0
0 A 2022-10-12 11:15:10 outcome2 0
0 A 2022-10-14 15:30:30 outcome1 0
1 B 2022-10-11 11:30:22 outcome1 0
1 B 2022-10-15 22:44:11 outcome2 0
1 B 2022-10-15 23:30:22 outcome3 0
1 B 2022-10-15 23:31:11 outcome2 1
CodePudding user response:
You can use:
df['value_cumsum'] = (df.groupby('ID')['value_test']
.cumsum().sub(1).where(df['value_test'], 0)
)
Or, if you also want to label the False:
df['value_cumsum'] = (df.groupby('ID')['value_test']
.cumsum().sub(1).clip(lower=0)
)
output:
ID datetime value value_test value_cumsum
0 A 2022-10-12 10:00:00 outcome1 False 0
0 A 2022-10-12 11:15:10 outcome2 True 0
0 A 2022-10-14 15:30:30 outcome1 False 0
1 B 2022-10-11 11:30:22 outcome1 False 0
1 B 2022-10-15 22:44:11 outcome2 True 0
1 B 2022-10-15 23:30:22 outcome3 False 0
1 B 2022-10-15 23:31:11 outcome2 True 1
without intermediate:
df['value_cumsum'] = (df['value'].eq('outcome2')
.groupby(df['ID'])
.cumsum().sub(1).clip(lower=0)
)
output:
ID datetime value value_cumsum
0 A 2022-10-12 10:00:00 outcome1 0
0 A 2022-10-12 11:15:10 outcome2 0
0 A 2022-10-14 15:30:30 outcome1 0
1 B 2022-10-11 11:30:22 outcome1 0
1 B 2022-10-15 22:44:11 outcome2 0
1 B 2022-10-15 23:30:22 outcome3 0
1 B 2022-10-15 23:31:11 outcome2 1