My dataframe looks like as follow:
no date status row_ant
1 11 Jan 2023, 07.00 ANT 1
1 11 Jan 2023, 09.00 AU 2
1 12 Jan 2023, 12.00 DLV 3
2 14 Jan 2023, 09.00 BA 1
2 14 Jan 2023, 10.00 AU 2
2 15 Jan 2023, 10.00 ANT 3
3 15 Jan 2023, 09.00 ANT 1
3 16 Jan 2023, 10.00 AU 2
3 16 Jan 2023, 10.00 ANT 3
Each no
has at least 2 rows in the dataframe which shows the history status by timestamps. Each row has timestamp, status, and also row number.
What I want to do is to create a summary where shows:
- the first date and the first status where status = ANT and row_ant = 1
- the last date and the last status whatever the value is
The desired result would be something like this:
no date_ant first_status date_last_status last_status
1 11 Jan 2023, 07.00 ANT 12 Jan 2023, 12.00 DLV
3 15 Jan 2023, 09.00 ANT 16 Jan 2023, 10.00 ANT
so far what I have done is to take the minimum and the maximum date only
df = pd.DataFrame(df.groupby(['no','status','row_ant'])
.agg({'date':['min','max']})
.rename(columns={'min':'date_ant','max':'date_last_status'})).reset_index()
but this one does not return the status for date_ant
and date_last_status
.
How should I write my script to get the desired dataframe in pandas? thank you in advance
CodePudding user response:
First convert column date
to datetimes by to_datetime
and if necessary sorting by both columns by DataFrame.sort_values
in both solutions.
Then filter last rows by no
with DataFrame.duplicated
with DataFrame.loc
for filter by mask and columns names in list. For df2
filter by mask with conditions.
Last use DataFrame.merge
:
df['date'] = pd.to_datetime(df['date'], format='%d %b %Y, %H.%M')
df = df.sort_values(['no','date'])
df1 = df.loc[~df.duplicated('no', keep='last'), ['no','date', 'status']]
df2 = df.loc[df['status'].eq('ANT') & df['row_ant'].eq(1),['no','date', 'status']]
df = df2.merge(df1, how='left', suffixes=('_first','_last'), on='no')
print (df)
no date_first status_first date_last status_last
0 1 2023-01-11 07:00:00 ANT 2023-01-12 12:00:00 DLV
1 3 2023-01-15 09:00:00 ANT 2023-01-16 10:00:00 ANT
Or use DataFrameGroupBy.idxmax
for last date
with change order of rows by DataFrame.iloc
:
df['date'] = pd.to_datetime(df['date'], format='%d %b %Y, %H.%M')
df = df.sort_values(['no','date'])
df11 = df.iloc[::-1]
df1 = df11.loc[df11.groupby('no')['date'].idxmax(), ['no','date', 'status']]
df2 = df.loc[df['status'].eq('ANT') & df['row_ant'].eq(1),['no','date', 'status']]
df = df2.merge(df1, how='left', suffixes=('_first','_last'), on='no')
print (df)
no date_first status_first date_last status_last
0 1 2023-01-11 07:00:00 ANT 2023-01-12 12:00:00 DLV
1 3 2023-01-15 09:00:00 ANT 2023-01-16 10:00:00 ANT