Home > Mobile >  get first and last column with pandas with condition
get first and last column with pandas with condition

Time:01-19

My dataframe looks like as follow:

no        date            status   row_ant
1    11 Jan 2023, 07.00      ANT      1
1    11 Jan 2023, 09.00      AU       2
1    12 Jan 2023, 12.00      DLV      3
2    14 Jan 2023, 09.00      BA       1
2    14 Jan 2023, 10.00      AU       2
2    15 Jan 2023, 10.00      ANT      3 
3    15 Jan 2023, 09.00      ANT      1
3    16 Jan 2023, 10.00      AU       2
3    16 Jan 2023, 10.00      ANT      3 

Each no has at least 2 rows in the dataframe which shows the history status by timestamps. Each row has timestamp, status, and also row number.

What I want to do is to create a summary where shows:

  • the first date and the first status where status = ANT and row_ant = 1
  • the last date and the last status whatever the value is

The desired result would be something like this:

no         date_ant        first_status   date_last_status    last_status
1    11 Jan 2023, 07.00        ANT        12 Jan 2023, 12.00     DLV
3    15 Jan 2023, 09.00        ANT        16 Jan 2023, 10.00     ANT

so far what I have done is to take the minimum and the maximum date only

df = pd.DataFrame(df.groupby(['no','status','row_ant'])
                .agg({'date':['min','max']})
                    .rename(columns={'min':'date_ant','max':'date_last_status'})).reset_index()

but this one does not return the status for date_ant and date_last_status.

How should I write my script to get the desired dataframe in pandas? thank you in advance

CodePudding user response:

First convert column date to datetimes by to_datetime and if necessary sorting by both columns by DataFrame.sort_values in both solutions.

Then filter last rows by no with DataFrame.duplicated with DataFrame.loc for filter by mask and columns names in list. For df2 filter by mask with conditions. Last use DataFrame.merge:

df['date'] = pd.to_datetime(df['date'], format='%d %b %Y, %H.%M')
df = df.sort_values(['no','date'])

df1 = df.loc[~df.duplicated('no', keep='last'), ['no','date', 'status']]
df2 = df.loc[df['status'].eq('ANT') & df['row_ant'].eq(1),['no','date', 'status']]

df = df2.merge(df1, how='left', suffixes=('_first','_last'), on='no')
print (df)
   no          date_first status_first           date_last status_last
0   1 2023-01-11 07:00:00          ANT 2023-01-12 12:00:00         DLV
1   3 2023-01-15 09:00:00          ANT 2023-01-16 10:00:00         ANT

Or use DataFrameGroupBy.idxmax for last date with change order of rows by DataFrame.iloc:

df['date'] = pd.to_datetime(df['date'], format='%d %b %Y, %H.%M')
df = df.sort_values(['no','date'])

df11 = df.iloc[::-1]
df1 = df11.loc[df11.groupby('no')['date'].idxmax(), ['no','date', 'status']]
df2 = df.loc[df['status'].eq('ANT') & df['row_ant'].eq(1),['no','date', 'status']]

df = df2.merge(df1, how='left', suffixes=('_first','_last'), on='no')
print (df)
   no          date_first status_first           date_last status_last
0   1 2023-01-11 07:00:00          ANT 2023-01-12 12:00:00         DLV
1   3 2023-01-15 09:00:00          ANT 2023-01-16 10:00:00         ANT
  • Related