I would like to use this dataframe
df = pd.DataFrame({'Serial' : ['A1', 'A1', 'A1', 'B1','B1', 'B1'],'Day' : ['01.01.2022', '01.01.2022', '01.01.2021', '01.01.2019', '01.01.2019', '01.01.2020'],'Else' : ['a', 'b', 'c', 'd','e', 'f']})
to groupby Serial and keep only rows with max(Day), ie here is my expected output:
Serial | Day | Else |
---|---|---|
A1 | 01.01.2022 | a |
A1 | 01.01.2022 | b |
B1 | 01.01.2020 | f |
I success to compute the max but don't know how to use it to filter in order to get the expected output.
df['Day']= pd.to_datetime(df['Day'], format="%d.%m.%Y")
df = df.groupby(['Serial'])['Day'].max()
CodePudding user response:
here is one way to do it
# convert the date to the YMD format for finding max
df['Day2']=pd.to_datetime(df['Day'], dayfirst=True)
# group on Serial, and return the max value against all rows of grouped result
# compare and filter where max date matches the date in DF
out=df.loc[df['Day2'].eq(df.groupby('Serial')['Day2'].transform(max))].drop(columns='Day2')
out
Serial Day Else
0 A1 01.01.2022 a
1 A1 01.01.2022 b
5 B1 01.01.2020 f
CodePudding user response:
Based on this answer, you should first get all the index where your date is the maximum. Then you can use your index on your dataframe. Something like that
df = pd.DataFrame({'Serial' : ['A1', 'A1', 'A1', 'B1','B1', 'B1'],'Day' : ['01.01.2022', '01.01.2022', '01.01.2021', '01.01.2019', '01.01.2019', '01.01.2020'],'Else' : ['a', 'b', 'c', 'd','e', 'f']})
df['Day'] = pd.to_datetime(df['Day'], format="%d.%m.%Y")
idx = df.groupby(['Serial'])['Day'].transform(max) == df['Day']
print(df[idx])
Which gives you your result as follow
Serial Day Else
0 A1 2022-01-01 a
1 A1 2022-01-01 b
5 B1 2020-01-01 f