Home > Software design >  Group by given columns and keep atleast 1 value for each column using pandas dataframe
Group by given columns and keep atleast 1 value for each column using pandas dataframe

Time:08-24

I have the following to start with:

   id       key      serial   flag   time
0  310000   first    100      1      2022-01-15 07:28:00.000
1  310000   second   100      1      2022-01-15 07:29:00.000
2  310000   second   100      1      2022-01-15 07:30:00.000
3  450000   first    200      1      2022-01-15 07:45:00.000
4  450000   second   200      1      2022-01-15 07:46:00.000
5  450000   third    200      1      2022-01-15 07:47:00.000
6  450000   third    200      1      2022-01-15 07:48:00.000

I want to keep atleast ONE of each KEY values and group them so that I drop any duplicates. If possible, keep the one that was with the ascending time column.

Ideal outcome:

id       key      serial   flag   time
310000   first    100      1      2022-01-15 07:28:00.000
310000   second   100      1      2022-01-15 07:29:00.000  
450000   first    200      1      2022-01-15 07:45:00.000  
450000   second   200      1      2022-01-15 07:46:00.000  
450000   third    200      1      2022-01-15 07:47:00.000  

The query i have so far is:

df.sort_values(by = ['id', 'key', 'serial', 'flag'],ascending = [True, True, False, False]).drop_duplicates(subset = ['id', 'key', 'serial', 'flag'])

but it doesn't quite give what I want. Any ideas?

CodePudding user response:

You should sort by time first.

df.sort_values("time").drop_duplicates(subset=["id", "key", "serial", "flag"])
       id     key  serial  flag                time
0  310000   first     100     1 2022-01-15 07:28:00
1  310000  second     100     1 2022-01-15 07:29:00
3  450000   first     200     1 2022-01-15 07:45:00
4  450000  second     200     1 2022-01-15 07:46:00
5  450000   third     200     1 2022-01-15 07:47:00
  • Related