Home > Back-end >  Pandas delete duplicate rows based on timestamp
Pandas delete duplicate rows based on timestamp

Time:10-15

I have a dataset where I have multiple duplicate records based on timestamps for the same date. I want to keep the record with the max timestamp and delete the other records for a given ID and timestamp combo.

Sample dataset

id|timestamp|value
--|---------|-----
1|2022-04-19T18:46:36.259 0000|xyz
1|2022-04-19T18:46:36.302 0000|xyz
1|2022-04-19T18:46:36.357 0000|xyz
1|2022-04-24T00:41:40:871 0000|xyz
1|2022-04-24T00:41:40:879 0000|xyz
1|2022-05-02T10:15:25.829 0000|xyz
1|2022-05-02T10:15:25.832 0000|xyz

Final Df

id|timestamp|value
--|---------|-----

1|2022-04-19T18:46:36.357 0000|xyz
1|2022-04-24T00:41:40:879 0000|xyz
1|2022-05-02T10:15:25.832 0000|xyz

CodePudding user response:

a combination of .groupby and .max will do

import pandas as pd

dates = pd.to_datetime(['01-01-1990', '01-02-1990', '01-02-1990', '01-03-1990'])
values = [1] * len(dates)
ids = values[:]

df = pd.DataFrame(zip(dates, values, ids), columns=['timestamp', 'val', 'id'])

selection = df.groupby(['val', 'id'])['timestamp'].max().reset_index()

print(selection)

output

   val  id  timestamp
0    1   1 1990-01-03

CodePudding user response:

You can use following code for your task.

df.groupby(["id","value"]).max() 

explanation: Fist group by using id and value column and then select only the maximum.

CodePudding user response:

if you add the data as a code, it'll be easier to share the result. Since you already have a data, its simpler to post it as a code or text

# To keep the lastdate but latest timestamp
# create a dateonly field from timestamp, in identifying the dupicates
# sort values so, we have latest timestamp for an id at the end
# drop duplicates based on id and timestamp. keeping last row
# finally drop the temp column

(df.assign(d=pd.to_datetime(df['timestamp']).dt.date)
 .sort_values(['id','timestamp'])
 .drop_duplicates(subset=['id','d'], keep='last')
 .drop(columns='d')
)

    id  timestamp   value
2   1   2022-04-19T18:46:36.357 0000    xyz
4   1   2022-04-24T00:41:40.879 0000    xyz
6   1   2022-05-02T10:15:25.832 0000    xyz
  • Related