I have a dataset where I have multiple duplicate records based on timestamps for the same date. I want to keep the record with the max timestamp and delete the other records for a given ID and timestamp combo.
Sample dataset
id|timestamp|value
--|---------|-----
1|2022-04-19T18:46:36.259 0000|xyz
1|2022-04-19T18:46:36.302 0000|xyz
1|2022-04-19T18:46:36.357 0000|xyz
1|2022-04-24T00:41:40:871 0000|xyz
1|2022-04-24T00:41:40:879 0000|xyz
1|2022-05-02T10:15:25.829 0000|xyz
1|2022-05-02T10:15:25.832 0000|xyz
Final Df
id|timestamp|value
--|---------|-----
1|2022-04-19T18:46:36.357 0000|xyz
1|2022-04-24T00:41:40:879 0000|xyz
1|2022-05-02T10:15:25.832 0000|xyz
CodePudding user response:
a combination of .groupby and .max will do
import pandas as pd
dates = pd.to_datetime(['01-01-1990', '01-02-1990', '01-02-1990', '01-03-1990'])
values = [1] * len(dates)
ids = values[:]
df = pd.DataFrame(zip(dates, values, ids), columns=['timestamp', 'val', 'id'])
selection = df.groupby(['val', 'id'])['timestamp'].max().reset_index()
print(selection)
output
val id timestamp
0 1 1 1990-01-03
CodePudding user response:
You can use following code for your task.
df.groupby(["id","value"]).max()
explanation: Fist group by using id and value column and then select only the maximum.
CodePudding user response:
if you add the data as a code, it'll be easier to share the result. Since you already have a data, its simpler to post it as a code or text
# To keep the lastdate but latest timestamp
# create a dateonly field from timestamp, in identifying the dupicates
# sort values so, we have latest timestamp for an id at the end
# drop duplicates based on id and timestamp. keeping last row
# finally drop the temp column
(df.assign(d=pd.to_datetime(df['timestamp']).dt.date)
.sort_values(['id','timestamp'])
.drop_duplicates(subset=['id','d'], keep='last')
.drop(columns='d')
)
id timestamp value
2 1 2022-04-19T18:46:36.357 0000 xyz
4 1 2022-04-24T00:41:40.879 0000 xyz
6 1 2022-05-02T10:15:25.832 0000 xyz