I have this df:
time
0 2022-07-28 13:35:00
1 2022-07-28 13:35:15
2 2022-07-28 13:35:15
3 2022-07-28 13:35:30
4 2022-07-28 13:35:30
I need to remove the first .time duplicate, so the final result will be like this:
time
0 2022-07-28 13:35:00
2 2022-07-28 13:35:15
4 2022-07-28 13:35:30
sample code:
tbl2 = {"time" :["2022-07-28 13:35:00", "2022-07-28 13:35:15", "2022-07-28 13:35:15",
"2022-07-28 13:35:30", "2022-07-28 13:35:30"]}
df2 = pd.DataFrame(tbl2)
df2.time = pd.to_datetime(df2.time)
CodePudding user response:
The solution above works fine as long as you only have a max of 1 duplicate (i.e. each time occurs at most twice). However, if you have more duplicates, you can use the approach below:
# Make sure that time data is sorted
df2.sort_values(by="time", inplace=True)
# Add a rownumber per time group and a count of rows per group
df2["time_rownum"] = df2.groupby("time").cumcount() 1
df2["time_count"] = df2.groupby("time")["time_rownum"].transform('max')
# Filter
mask = (df2["time_count"] == 1) | ((df2["time_rownum"] > 1) & ((df2["time_count"] > 1)))
df2 = df2.loc[mask, ["time"]]
CodePudding user response:
You can use drop_duplicates
and keep the last
value.
df = pd.DataFrame({"time" :["2022-07-28 13:35:00", "2022-07-28 13:35:15", "2022-07-28 13:35:15", "2022-07-28 13:35:30", "2022-07-28 13:35:30"]})
df.time = pd.to_datetime(df.time)
df.drop_duplicates(keep='last', inplace=True)
print(df)
Output:
time
0 2022-07-28 13:35:00
2 2022-07-28 13:35:15
4 2022-07-28 13:35:30