Please find below my input/desired output :
INPUT
Id Status Date
0 Id001 off 01/01/2022
1 Id001 online 01/06/2022
2 Id002 off 05/12/2021
3 Id002 off 30/08/2021
4 Id003 running 02/03/2021
5 Id004 online 21/10/2021
6 Id004 running 01/04/2022
7 Id005 online 01/02/2022
8 Id005 online 02/07/2022
OUTPUT (desired)
Id Status Date
0 Id001 online 01/06/2022
1 Id002 off 05/12/2021
2 Id003 running 02/03/2021
3 Id004 online 21/10/2021
4 Id005 online 01/02/2022
The goal is firstly to have one line per Id in the output.
The output will be made based on a this simple statement :
if a group of Id has «online» as a Status then:
if «online» occurences > 1 then:
Choose the oldest Id that has «online»
else:
choose the Id that has «online»
else:
Choose the most recent Id
This is what I've tried so far :
import pandas as pd
import numpy as np
df = pd.DataFrame({'Id': ['Id001','Id001','Id002','Id002','Id003','Id004','Id004','Id005','Id005'],
'Status': ['off','online','off','off','running','online','running','online','online'],
'Date': ['01/01/2022','01/06/2022','05/12/2021','30/08/2021','02/03/2021','21/10/2021','01/04/2022','01/02/2022','02/07/2022']})
df.groupby(['Id', 'Status'], as_index=False).agg({'Date' : [np.min, np.max]}).reset_index()
Do you have any suggestion/propositions, please ?
Any help we be so much appreciated !
CodePudding user response:
Answer is completely edited - first filter only online
rows, sorting by Date
and remove duplicates by first Id
:
df1 = df[df['Status'].eq('online')].sort_values('Date').drop_duplicates('Id')
print (df1)
Id Status Date
5 Id004 online 2021-10-21
7 Id005 online 2022-02-01
1 Id001 online 2022-06-01
Then filter not matched Id
and sorting descending:
df2 =df[~df['Id'].isin(df1['Id'])].sort_values('Date',ascending=False).drop_duplicates('Id')
print (df2)
Id Status Date
2 Id002 off 2021-12-05
4 Id003 running 2021-03-02
Last join both Dataframes:
df = pd.concat([df1, df2]).sort_values('Id', ignore_index=True)
print (df)
Id Status Date
0 Id001 online 2022-06-01
1 Id002 off 2021-12-05
2 Id003 running 2021-03-02
3 Id004 online 2021-10-21
4 Id005 online 2022-02-01
Original solution should be changed:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df1 = (df.assign(s = df['Status'].eq('online')).groupby(['Id','s'])
.agg(Date_min=('Date','idxmin'), Date_max=('Date','idxmax')))
df1 = df1[~df1.index.get_level_values(0).duplicated(keep='last')].reset_index()
print (df1)
Id s Date_min Date_max
0 Id001 True 1 1
1 Id002 False 3 2
2 Id003 False 4 4
3 Id004 True 5 5
4 Id005 True 7 8
df = df.loc[np.where(df1['s'], df1['Date_min'], df1['Date_max'])]
print (df)
Id Status Date
1 Id001 online 2022-06-01
2 Id002 off 2021-12-05
4 Id003 running 2021-03-02
5 Id004 online 2021-10-21
7 Id005 online 2022-02-01