Home > Back-end >  How to replace duplicate rows in pandas groupby with the more recent record
How to replace duplicate rows in pandas groupby with the more recent record

Time:07-01

Good Morning.

Let's say I have a table that looks like the following:

Column 1 Column 2 Selection Criteria (Date)
A B 2022-05
A B 2022-06
C D 2022-05
C D 2022-06
E F 2022-05

I'd like to group this table by column_1 and column_2,

(i.e. df.groupby['column 1','column 2])

and then drop the duplicate rows that have the older date. Running this statement against the table above would yield:

Column 1 Column 2 Selection Criteria (Date)
A B 2022-06
C D 2022-06
E F 2022-05

In general Column 1 and 2 make up the primary key of the table, and the selection criteria column is only included for the selection of records based on time.

Is anyone able to point me in the right direction?

CodePudding user response:

df.sort_values('date').groupby(['column 1','column 2']).tail(1)

you can try this

CodePudding user response:

You can use last:

df.sort_values('Selection Criteria(Date)').groupby(['Column 1', 'Column 2'], as_index=False).last()

print(df):

  Column 1 Column 2 Selection Criteria(Date)
0        A        B                  2022-06
1        C        D                  2022-06
2        E        F                  2022-05

Or max on datetime

df['Selection Criteria(Date)'] = pd.to_datetime(df['Selection Criteria(Date)'])
df.groupby(['Column 1', 'Column 2'], as_index=False).max()

print(df):

  Column 1 Column 2 Selection Criteria(Date)
0        A        B               2022-06-01
1        C        D               2022-06-01
2        E        F               2022-05-01
  • Related