I have some duplicate rows in my dataframe, the first occurrence has Nan value in specific column and the second occurrence has that value in the sma ecolumn. I want to remove duplicate, keep the fisrt occurrence and replace Nan value with value from second occurrence. Like this:
Name | Rank | City |
---|---|---|
Andre Ryan | NaN | London |
Andre Ryan | 86 | Paris |
... | ...- | ... |
The goal
Name | Rank | City |
---|---|---|
Andre Ryan | 86 | London |
... | ... | ... |
CodePudding user response:
Here's one way to dedup by using groupby
and agg
on this example dataset:
import pandas as pd
#create the test table
df = pd.DataFrame({
'Name':['A','A','B','C','B','C'],
'Rank':[None,1,None,None,2,3],
'City':['q','r','s','t','u','v'],
})
#groupby and agg to dedup
dedup_df = df.groupby('Name').agg(
Rank = ('Rank',lambda ranks: ranks.dropna().iloc[0]),
City = ('City','first'),
).reset_index()
dedup_df
Output
CodePudding user response:
You can try groupby.bfill
then drop_duplicates
out = (df.assign(**df.groupby('Name').bfill())
.drop_duplicates('Name', keep='first'))
print(out)
Name Rank City
0 Andre Ryan 86.0 London