Home > Net >  pandas drop_duplicates condition on two other columns values
pandas drop_duplicates condition on two other columns values

Time:05-24

I have a datframe with columns A,B and C. Column A is where there are duplicates. Column B is where there is email value or NaN. Column C is where there is 'wait' value or a number. My dataframe has duplicate values in A. I would like to keep those who have a non-NaN value in B and the non 'wait' value in C (ie numbers). How could I do that on a df dataframe? I have tried df.drop_duplicates('A') but i dont see any conditions on other columns

Edit : sample data :

df=pd.DataFrame({'A':[1,1,2,2,3,3],'B':['[email protected]',np.nan,np.nan,'[email protected]','np.nan',np.nan],'C':[123,456,567,'wait','wait','wait']})
>>> df
   A        B     C
0  1  [email protected]   123
1  1      NaN   456
2  2      NaN   567
3  2  [email protected]  wait
4  3   np.nan  wait
5  3      NaN  wait

I would like a resulting dataframe as

>>> df
   A        B     C
0  1  [email protected]   123
1  2  [email protected]   567
2  3   np.nan  wait

Thank you Best,

CodePudding user response:

Solution sorting per A, C columns with test if match wait first and then get first non missing value if exist per groups by column A:

df = df.sort_values(['A', 'C'], key = lambda x: x.eq('wait')).groupby('A').first()
print (df)
         B     C
A               
1  [email protected]   123
2  [email protected]   567
3   np.nan  wait
    
  • Related