Given a dataframe, I need to increment the event_id when a discontinuity observed in the column. Here for the given data below, if the difference between the current data and the previous data is >5 then the succeeding column has to be mark with next event_id.
id, data, event_id,
aa, 2, 1,
aa, 4, 1,
aa, 6, 1,
aa, 12, 2,
aa, 14, 2,
aa, 15, 2,
I tried with below code,
df['pre_data']=df.groupby('id')['data'].shift(1)
df['diff_flag']=np.where((df['data']-df['pre_data'])<5,1,0)
df['event_id']=df['diff_flag'].ne(df.groupby('id')['diff_flag'].shift()).cumsum()
But the code is giving the output of event_id as (1 ,1,1,2,3,3) while the expected output is (1 ,1,1,2,2,2)
CodePudding user response:
if i understand well your problem you need to increase the event_id
every time the difference is more than 5.
In this case the solution is in your code you just need to change this
df['diff_flag']=np.where((df['data']-df['pre_data'])<5,1,0)
to this:
df['diff_flag']=np.where((df['data']-df['pre_data'])>5,1,0)
and return:
df["diff_flag"].cumsum()
because the numpy where
is the oposite of pandas where
in numpy we update the value that match the condition