Using Python, how do I remove duplicates in a PANDAS dataframe column while keeping/ignoring all &#0-CodePudding

I have a dataframe like this:

import pandas as pd

data1 = {
     "siteID": [1, 2, 3, 1, 2, 'nan', 'nan', 'nan'],
     "date": [42, 30, 43, 29, 26, 34, 10, 14],
    }
df = pd.DataFrame(data1)

But I want to delete any duplicates in siteID, keeping only the most up-to-date value AND keeping all 'nan' values.

I get close with this code:

df_no_dup = df.sort_values('date').drop_duplicates('siteID', keep='last')

which only keeps the siteID with the highest date value. The issue is that most of the rows with 'nan' for siteID are being removed when I want to ignore them all. Is there any way to keep all the rows where siteID is equal to 'nan'?

Expected output:

siteID  date
    nan 10
    nan 14
    2   30
    nan 34
    1   42
    3   43

CodePudding user response：

I would use df.duplicated to create a custom condition.

Like this

df.drop(df[df.sort_values('date').duplicated('siteID', keep='last') & (df.siteID!='nan')].index)

Result

  siteID  date
0      1    42
1      2    30
2      3    43
5    nan    34
6    nan    10
7    nan    14