Removal of rows with a duplicate value in a column based on a condition from another column

I have data from a CSV file read into Pandas dataframe (all cells have string type, NaN:s are replaced with "" already) with some duplicate values that I need to get rid of.

A sample input CSV:

Col1,Col2,Col3
A,rrrrr,fff
A,,ffff
B,rrr,fffff
C,,ffffff
D,rrrrrrr,ffff
C,rrrr,fffff
E,rrrrr,fff
C,,kkkkk

I need to remove those rows that have a duplicate value in Col1 and where Col2 value is "". So the desired result shall look like this:

Col1,Col2,Col3
A,rrrrr,fff
B,rrr,fffff
D,rrrrrrr,ffff
C,rrrr,fffff
E,rrrrr,fff

I use the following line of code to de-duplicate:

my_df = my_df.loc[(my_df["Col2"] != "") | ~my_df["Col1"].duplicated()]

Which removes some but not all desired rows with the duplicates in Col1. If such a "duplicate row" comes before a rowsthat shall be kept (which as a non-empty Col2), it will not be removed, and my code gives a result like this:

Col1,Col2,Col3
A,rrrrr,fff
B,rrr,fffff
C,,ffffff
D,rrrrrrr,ffff
C,rrrr,fffff
E,rrrrr,fff

As you see, the duplicate for A is removed correctly, while for C one of the duplicates (the one which occurs before the row that shall be kept) stays.

Any advice how to fix? I might be missing something right under my nose here.

CodePudding user response：

You can compute the count of each value in Col1 and then use that computed_count and pandas.Series.isna() for getting waht you want.

df['cnt_col1'] = df.groupby('Col1')['Col1'].transform('count')
m = (df["Col2"].isna()) & (df["cnt_col1"]>1)
df_new = df[~m].drop('cnt_col1', axis=1)
print(df_new)

Or without groupby:

cnt = dict(df['Col1'].value_counts())
cnt_col1 = df['Col1'].map(cnt)
m = (df["Col2"].isna()) & (cnt_col1>1)
print(df[~m])

Output:

  Col1     Col2   Col3
0    A    rrrrr    fff
2    B      rrr  fffff
4    D  rrrrrrr   ffff
5    C     rrrr  fffff
6    E    rrrrr    fff