How to find the cosine similarity between 2 dataframe in pandas?-CodePudding

I have 2 dataframes:

df1:
font_label |font_size | len_words |letter_per_words |text_area_ratio | image_area   | Effectiveness |
    1          11           7          9.714286          0.046231         310200    |    20.2
    2          10.5         8           11               0.0399           310150         19.2
    1          11.5         9           10               0.040            310100         21.2

df2:

font_label |font_size | len_words |letter_per_words |text_area_ratio | image_area   | Effectiveness |
    1          12           8          10.5              0.0399           310100    |    21

I was trying to write a function in which df2 is passed and the output should be a row from df1 which is the closest match based on cosine similarity, and the output row(i.e selected row from df1) should have the Effectiveness column greater than Effectiveness column in df2.

I tried to do the following:

from sklearn.metrics.pairwise import cosine_similarity

X = cosine_similarity(df1)
y = cosine_similarity(df2)

After this i have no idea how to proceed to get the output.

Expected Output:

When df2 is passed to the function my expected output is:

font_label |font_size | len_words |letter_per_words |text_area_ratio | image_area   | Effectiveness |
    1          11.5         9           10               0.040            310100         21.2

CodePudding user response：

One way to do that is as follows

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def get_closest_row(df1, df2):

    # Get the cosine similarity
    cos_sim = cosine_similarity(df1.drop(columns=['Effectiveness']), df2.drop(columns=['Effectiveness']))

    # Get the index of the maximum value in the cosine similarity
    index = np.argmax(cos_sim)

    # Get the row from df1 with the maximum cosine similarity
    row = df1.iloc[index]

    # Return the row
    return row

Then, if one applies to df1 and df2, one gets the following

df_new = get_closest_row(df1, df2)

[Out]:

font_label               1.00
font_size               11.50
len_words                9.00
letter_per_words        10.00
text_area_ratio          0.04
image_area          310100.00
Effectiveness           21.20
Name: 2, dtype: float64

However, as one wants a dataframe, one will have to convert it with pandas.DataFrame. In order to end up with the desired output, one will have to transpose it, so one passses .T

df_new = pd.DataFrame(df_new).T

[Out]:

   font_label  font_size  len_words  ...  text_area_ratio  image_area  Effectiveness
2         1.0       11.5        9.0  ...             0.04    310100.0           21.2

A one liner would be as follows

df_new = pd.DataFrame(df1.iloc[np.argmax(cosine_similarity(df1.drop(columns=['Effectiveness']), df2.drop(columns=['Effectiveness'])))]).T

[Out]:

   font_label  font_size  len_words  ...  text_area_ratio  image_area  Effectiveness
2         1.0       11.5        9.0  ...             0.04    310100.0           21.2