How to modify column value based on some condition in a dataframe?-CodePudding

Suppose I have a pandas df like the following. For each item, such as python, it has the top3 similar items in top1, top2, top3, and their similarity scores score1, score2, score3. I want to set the top N item as empty string if the similarity score is below 0.8. For each row, if all scores are below 0.8, then drop this row totally.

  0           top1       top2   top3  score1   score2   score3
0  python      perl       php    java   0.9      0.7      0.4
1  coke        diel_coke  pepsi  taco   0.85     0.9      0.23
2  apple       car        house  hill   0.3      0.1      0.05

So the dataframe will become the following after the drop:

    0          top1       top2   top3  score1   score2   score3
0  python      perl                     0.9            
1  coke        diel_coke  pepsi         0.85     0.9

If I iterate the df row by row, it seems very slow if the df is big. How can I achieve the purpose without iterating row by row?

CodePudding user response：

Filter columns scores and compare 0.8, then set missing values if match, last remove rows with only missing values:

#ordering both columns
L1 = ['top1','top2','top3']
L2 = ['score1','score2','score3']

#alternative if same ordering
L1 = df.filter(like='top').columns.tolist()
L2 = df.filter(like='score').columns.tolist()

#repeat L2 by L2   L2 for masking L1   L2 columns
m = df[L2   L2].lt(0.8).to_numpy()
df[L1   L2] = df[L1   L2].mask(m)
#remove missing values by all columns L1   L2
df1 = df.dropna(how='all', subset=L1   L2)
print (df1)
        0       top1   top2 top3  score1  score2  score3
0  python       perl    NaN  NaN    0.90     NaN     NaN
1    coke  diel_coke  pepsi  NaN    0.85     0.9     NaN


df2 = df1.fillna('')
print (df2)
        0       top1   top2 top3  score1 score2 score3
0  python       perl                0.90              
1    coke  diel_coke  pepsi         0.85    0.9

For empty string solution is:

L1 = ['top1','top2','top3']
L2 = ['score1','score2','score3']

m = df[L2   L2].lt(0.8).to_numpy()
df[L1   L2] = df[L1   L2].mask(m,'')
df2 = df[df[L1   L2].ne('').any(axis=1)]
print (df2)
        0       top1   top2 top3 score1 score2 score3
0  python       perl                0.9              
1    coke  diel_coke  pepsi        0.85    0.9

CodePudding user response：

# Collect the relevant values:
tops = df.filter(like='top')
scores = df.filter(like='score')

# Create the mask:
low_scores = scores.lt(0.8).to_numpy()

# Apply the mask to each set of values:
for value in (tops, scores):
    df[value.columns] = value.mask(low_scores, '')

# Drop rows where they were all low scores:
df = df[~low_scores.all(axis=1)]
print(df)

Output:

        0       top1   top2 top3 score1 score2 score3
0  python       perl                0.9
1    coke  diel_coke  pepsi        0.85    0.9