Home > database >  How to delete first date duplicate
How to delete first date duplicate

Time:08-01

I have this df:

             time
0 2022-07-28 13:35:00
1 2022-07-28 13:35:15
2 2022-07-28 13:35:15
3 2022-07-28 13:35:30
4 2022-07-28 13:35:30

I need to remove the first .time duplicate, so the final result will be like this:

             time
0 2022-07-28 13:35:00

2 2022-07-28 13:35:15

4 2022-07-28 13:35:30

sample code:

tbl2 = {"time" :["2022-07-28 13:35:00", "2022-07-28 13:35:15", "2022-07-28 13:35:15",
            "2022-07-28 13:35:30", "2022-07-28 13:35:30"]}



df2 = pd.DataFrame(tbl2)

df2.time = pd.to_datetime(df2.time)

CodePudding user response:

The solution above works fine as long as you only have a max of 1 duplicate (i.e. each time occurs at most twice). However, if you have more duplicates, you can use the approach below:

# Make sure that time data is sorted
df2.sort_values(by="time", inplace=True)

# Add a rownumber per time group and a count of rows per group
df2["time_rownum"] = df2.groupby("time").cumcount()   1
df2["time_count"] = df2.groupby("time")["time_rownum"].transform('max')

# Filter
mask = (df2["time_count"] == 1) | ((df2["time_rownum"] > 1) & ((df2["time_count"] > 1)))
df2 = df2.loc[mask, ["time"]]

CodePudding user response:

You can use drop_duplicates and keep the last value.

df = pd.DataFrame({"time" :["2022-07-28 13:35:00", "2022-07-28 13:35:15", "2022-07-28 13:35:15", "2022-07-28 13:35:30", "2022-07-28 13:35:30"]})
df.time = pd.to_datetime(df.time)
df.drop_duplicates(keep='last', inplace=True)
print(df)

Output:

                 time
0 2022-07-28 13:35:00
2 2022-07-28 13:35:15
4 2022-07-28 13:35:30
  • Related