I have the following to start with:
id key serial flag time
0 310000 first 100 1 2022-01-15 07:28:00.000
1 310000 second 100 1 2022-01-15 07:29:00.000
2 310000 second 100 1 2022-01-15 07:30:00.000
3 450000 first 200 1 2022-01-15 07:45:00.000
4 450000 second 200 1 2022-01-15 07:46:00.000
5 450000 third 200 1 2022-01-15 07:47:00.000
6 450000 third 200 1 2022-01-15 07:48:00.000
I want to keep atleast ONE of each KEY values and group them so that I drop any duplicates. If possible, keep the one that was with the ascending time column.
Ideal outcome:
id key serial flag time
310000 first 100 1 2022-01-15 07:28:00.000
310000 second 100 1 2022-01-15 07:29:00.000
450000 first 200 1 2022-01-15 07:45:00.000
450000 second 200 1 2022-01-15 07:46:00.000
450000 third 200 1 2022-01-15 07:47:00.000
The query i have so far is:
df.sort_values(by = ['id', 'key', 'serial', 'flag'],ascending = [True, True, False, False]).drop_duplicates(subset = ['id', 'key', 'serial', 'flag'])
but it doesn't quite give what I want. Any ideas?
CodePudding user response:
You should sort by time
first.
df.sort_values("time").drop_duplicates(subset=["id", "key", "serial", "flag"])
id key serial flag time
0 310000 first 100 1 2022-01-15 07:28:00
1 310000 second 100 1 2022-01-15 07:29:00
3 450000 first 200 1 2022-01-15 07:45:00
4 450000 second 200 1 2022-01-15 07:46:00
5 450000 third 200 1 2022-01-15 07:47:00