How to check each column for value greater than and accept if 3 out of 5 columns has that value-CodePudding

I am trying to make a article similarity checker by comparing 6 articles with list of articles that I obtained from an API. I have used cosine similarity to compare each article one by one with the 6 articles that I am using as baseline.

My dataframe now looks like this:

id	Article	cosinesin1	cosinesin2	cosinesin3	cosinesin4	cosinesin5	cosinesin6	Similar
id1	[Article1]	0.2	0.5	0.6	0.8	0.7	0.8	True
id2	[Article2]	0.1	0.2	0.03	0.8	0.2	0.45	False

So I want to add Similar column in my dataframe that could check values for each Cosinesin (1-6) and return True if at least 3 out of 6 has value more than 0.5 otherwise return False.

Is there any way to do this in python?

Thanks

CodePudding user response：

In Python, you can treat True and False as integers, 1 and 0, respectively.

So if you compare all the similarity metrics to 0.5, you can sum over the resulting Boolean DataFrame along the columns, to get the number of comparisons that resulted in True for each row. Comparing those numbers to 3 yields the column you want:

cos_cols = [f"cosinesin{i}" for i in range(1, 7)]    
df['Similar'] = (df[cos_cols] > 0.5).sum(axis=1) >= 3