I have a dataframe like this:
import pandas as pd
data1 = {
"siteID": [1, 2, 3, 1, 2, 'nan', 'nan', 'nan'],
"date": [42, 30, 43, 29, 26, 34, 10, 14],
}
df = pd.DataFrame(data1)
But I want to delete any duplicates in siteID, keeping only the most up-to-date value AND keeping all 'nan' values.
I get close with this code:
df_no_dup = df.sort_values('date').drop_duplicates('siteID', keep='last')
which only keeps the siteID with the highest date value. The issue is that most of the rows with 'nan' for siteID are being removed when I want to ignore them all. Is there any way to keep all the rows where siteID is equal to 'nan'?
Expected output:
siteID date
nan 10
nan 14
2 30
nan 34
1 42
3 43
CodePudding user response:
I would use df.duplicated
to create a custom condition.
Like this
df.drop(df[df.sort_values('date').duplicated('siteID', keep='last') & (df.siteID!='nan')].index)
Result
siteID date
0 1 42
1 2 30
2 3 43
5 nan 34
6 nan 10
7 nan 14