I have data from a CSV file read into Pandas dataframe (all cells have string type, NaN:s are replaced with "" already) with some duplicate values that I need to get rid of.
A sample input CSV:
Col1,Col2,Col3
A,rrrrr,fff
A,,ffff
B,rrr,fffff
C,,ffffff
D,rrrrrrr,ffff
C,rrrr,fffff
E,rrrrr,fff
C,,kkkkk
I need to remove those rows that have a duplicate value in Col1 and where Col2 value is "". So the desired result shall look like this:
Col1,Col2,Col3
A,rrrrr,fff
B,rrr,fffff
D,rrrrrrr,ffff
C,rrrr,fffff
E,rrrrr,fff
I use the following line of code to de-duplicate:
my_df = my_df.loc[(my_df["Col2"] != "") | ~my_df["Col1"].duplicated()]
Which removes some but not all desired rows with the duplicates in Col1. If such a "duplicate row" comes before a rowsthat shall be kept (which as a non-empty Col2), it will not be removed, and my code gives a result like this:
Col1,Col2,Col3
A,rrrrr,fff
B,rrr,fffff
C,,ffffff
D,rrrrrrr,ffff
C,rrrr,fffff
E,rrrrr,fff
As you see, the duplicate for A is removed correctly, while for C one of the duplicates (the one which occurs before the row that shall be kept) stays.
Any advice how to fix? I might be missing something right under my nose here.
CodePudding user response:
You can compute the count of each value in Col1
and then use that computed_count
and pandas.Series.isna()
for getting waht you want.
df['cnt_col1'] = df.groupby('Col1')['Col1'].transform('count')
m = (df["Col2"].isna()) & (df["cnt_col1"]>1)
df_new = df[~m].drop('cnt_col1', axis=1)
print(df_new)
Or without groupby:
cnt = dict(df['Col1'].value_counts())
cnt_col1 = df['Col1'].map(cnt)
m = (df["Col2"].isna()) & (cnt_col1>1)
print(df[~m])
Output:
Col1 Col2 Col3
0 A rrrrr fff
2 B rrr fffff
4 D rrrrrrr ffff
5 C rrrr fffff
6 E rrrrr fff