I am trying to make a article similarity checker by comparing 6 articles with list of articles that I obtained from an API. I have used cosine similarity to compare each article one by one with the 6 articles that I am using as baseline.
My dataframe now looks like this:
id | Article | cosinesin1 | cosinesin2 | cosinesin3 | cosinesin4 | cosinesin5 | cosinesin6 | Similar |
---|---|---|---|---|---|---|---|---|
id1 | [Article1] | 0.2 | 0.5 | 0.6 | 0.8 | 0.7 | 0.8 | True |
id2 | [Article2] | 0.1 | 0.2 | 0.03 | 0.8 | 0.2 | 0.45 | False |
So I want to add Similar
column in my dataframe that could check values for each Cosinesin (1-6) and return True if at least 3 out of 6 has value more than 0.5 otherwise return False.
Is there any way to do this in python?
Thanks
CodePudding user response:
In Python, you can treat True
and False
as integers, 1 and 0, respectively.
So if you compare all the similarity metrics to 0.5, you can sum over the resulting Boolean DataFrame along the columns, to get the number of comparisons that resulted in True
for each row. Comparing those numbers to 3 yields the column you want:
cos_cols = [f"cosinesin{i}" for i in range(1, 7)]
df['Similar'] = (df[cos_cols] > 0.5).sum(axis=1) >= 3