I have below dataframe:
No: Fee:
111 500
111 500
222 300
222 300
123 400
If data in No is duplicate, I want to keep only one fee and remove others. Should look like below:
No: Fee:
111 500
111
222 300
222
123 400
I actually have no idea where to start, so please guide here.
Thanks.
CodePudding user response:
Use DataFrame.duplicated
with set empty string by DataFrame.loc
:
#if need test duplicated by both columns
mask = df.duplicated(['No','Fee'])
df.loc[mask, 'Fee'] = ''
print (df)
No Fee
0 111 500
1 111
2 222 300
3 222
4 123 400
But then lost numeric column, because mixed numbers with strings:
print (df['Fee'].dtype)
object
Possible solution is use missing values if need numeric column:
df.loc[mask, 'Fee'] = np.nan
print (df)
No Fee
0 111 500.0
1 111 NaN
2 222 300.0
3 222 NaN
4 123 400.0
print (df['Fee'].dtype)
float64
df.loc[mask, 'Fee'] = np.nan
df['Fee'] = df['Fee'].astype('Int64')
print (df)
No Fee
0 111 500
1 111 <NA>
2 222 300
3 222 <NA>
4 123 400
print (df['Fee'].dtype)
Int64