I have a dataset that looks like this
id event time
1 open 2022-07-05
1 close 2021-05-05
2 open 2022-05-05
3 open 2019-07-12
1 close 2022-06-05
3 open 2018-07-12
3 close 2018-08-12
2 close 2023-05-05
I want to find first occurrence for each event. It is important that close goes after open
id event time
1 open 2022-07-05
1 close 2021-05-05
2 open 2022-05-05
3 open 2018-07-12
3 close 2018-08-12
2 close 2023-05-05
CodePudding user response:
Update
It is important that close goes after open
I slightly modify your dataframe:
id event time
0 1 open 2022-07-05
1 1 close 2021-05-05
2 2 close 2022-04-04 # close event occurs before open event
3 2 open 2022-05-05
4 3 open 2019-07-12
5 1 close 2022-06-05
6 3 open 2018-07-12
7 3 close 2018-08-12
8 2 close 2023-05-05
You can use:
keep_first = lambda x: x[x['event'].eq('open').cumsum().gt(0)].drop_duplicates(['id', 'event'])
out = (df.sort_values(['event', 'time'], ascending=[False, True])
.groupby('id').apply(keep_first).droplevel(0))
print(out)
# Output
id event time
0 1 open 2022-07-05
1 1 close 2021-05-05
3 2 open 2022-05-05
2 2 close 2022-04-04
6 3 open 2018-07-12
7 3 close 2018-08-12
CodePudding user response:
First sorting by id
and time
and extract open-close
pairs per id
:
df['time'] = pd.to_datetime(df['time'])
df = df.sort_values(['id','time'], ascending=[True, False])
m1 = df['event'].eq('open') & df.groupby('id')['event'].shift(-1).eq('close')
m2 = df['event'].eq('close') & df.groupby('id')['event'].shift().eq('open')
df2 = df[m1 | m2]
Then if multiple pairs per id
remove duplicates:
df = df.drop_duplicates(['id', 'event'])