Suppose I have a pandas df like the following. For each item, such as python, it has the top3 similar items in top1, top2, top3, and their similarity scores score1, score2, score3. I want to set the top N item as empty string if the similarity score is below 0.8. For each row, if all scores are below 0.8, then drop this row totally.
0 top1 top2 top3 score1 score2 score3
0 python perl php java 0.9 0.7 0.4
1 coke diel_coke pepsi taco 0.85 0.9 0.23
2 apple car house hill 0.3 0.1 0.05
So the dataframe will become the following after the drop:
0 top1 top2 top3 score1 score2 score3
0 python perl 0.9
1 coke diel_coke pepsi 0.85 0.9
If I iterate the df row by row, it seems very slow if the df is big. How can I achieve the purpose without iterating row by row?
CodePudding user response:
Filter columns score
s and compare 0.8
, then set missing values if match, last remove rows with only missing values:
#ordering both columns
L1 = ['top1','top2','top3']
L2 = ['score1','score2','score3']
#alternative if same ordering
L1 = df.filter(like='top').columns.tolist()
L2 = df.filter(like='score').columns.tolist()
#repeat L2 by L2 L2 for masking L1 L2 columns
m = df[L2 L2].lt(0.8).to_numpy()
df[L1 L2] = df[L1 L2].mask(m)
#remove missing values by all columns L1 L2
df1 = df.dropna(how='all', subset=L1 L2)
print (df1)
0 top1 top2 top3 score1 score2 score3
0 python perl NaN NaN 0.90 NaN NaN
1 coke diel_coke pepsi NaN 0.85 0.9 NaN
df2 = df1.fillna('')
print (df2)
0 top1 top2 top3 score1 score2 score3
0 python perl 0.90
1 coke diel_coke pepsi 0.85 0.9
For empty string solution is:
L1 = ['top1','top2','top3']
L2 = ['score1','score2','score3']
m = df[L2 L2].lt(0.8).to_numpy()
df[L1 L2] = df[L1 L2].mask(m,'')
df2 = df[df[L1 L2].ne('').any(axis=1)]
print (df2)
0 top1 top2 top3 score1 score2 score3
0 python perl 0.9
1 coke diel_coke pepsi 0.85 0.9
CodePudding user response:
# Collect the relevant values:
tops = df.filter(like='top')
scores = df.filter(like='score')
# Create the mask:
low_scores = scores.lt(0.8).to_numpy()
# Apply the mask to each set of values:
for value in (tops, scores):
df[value.columns] = value.mask(low_scores, '')
# Drop rows where they were all low scores:
df = df[~low_scores.all(axis=1)]
print(df)
Output:
0 top1 top2 top3 score1 score2 score3
0 python perl 0.9
1 coke diel_coke pepsi 0.85 0.9