I have 2 dataframes:
df1:
font_label |font_size | len_words |letter_per_words |text_area_ratio | image_area | Effectiveness |
1 11 7 9.714286 0.046231 310200 | 20.2
2 10.5 8 11 0.0399 310150 19.2
1 11.5 9 10 0.040 310100 21.2
df2:
font_label |font_size | len_words |letter_per_words |text_area_ratio | image_area | Effectiveness |
1 12 8 10.5 0.0399 310100 | 21
I was trying to write a function in which df2 is passed and the output should be a row from df1 which is the closest match based on cosine similarity, and the output row(i.e selected row from df1) should have the Effectiveness column greater than Effectiveness column in df2.
I tried to do the following:
from sklearn.metrics.pairwise import cosine_similarity
X = cosine_similarity(df1)
y = cosine_similarity(df2)
After this i have no idea how to proceed to get the output.
Expected Output:
When df2 is passed to the function my expected output is:
font_label |font_size | len_words |letter_per_words |text_area_ratio | image_area | Effectiveness |
1 11.5 9 10 0.040 310100 21.2
CodePudding user response:
One way to do that is as follows
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def get_closest_row(df1, df2):
# Get the cosine similarity
cos_sim = cosine_similarity(df1.drop(columns=['Effectiveness']), df2.drop(columns=['Effectiveness']))
# Get the index of the maximum value in the cosine similarity
index = np.argmax(cos_sim)
# Get the row from df1 with the maximum cosine similarity
row = df1.iloc[index]
# Return the row
return row
Then, if one applies to df1
and df2
, one gets the following
df_new = get_closest_row(df1, df2)
[Out]:
font_label 1.00
font_size 11.50
len_words 9.00
letter_per_words 10.00
text_area_ratio 0.04
image_area 310100.00
Effectiveness 21.20
Name: 2, dtype: float64
However, as one wants a dataframe, one will have to convert it with pandas.DataFrame
. In order to end up with the desired output, one will have to transpose it, so one passses .T
df_new = pd.DataFrame(df_new).T
[Out]:
font_label font_size len_words ... text_area_ratio image_area Effectiveness
2 1.0 11.5 9.0 ... 0.04 310100.0 21.2
A one liner would be as follows
df_new = pd.DataFrame(df1.iloc[np.argmax(cosine_similarity(df1.drop(columns=['Effectiveness']), df2.drop(columns=['Effectiveness'])))]).T
[Out]:
font_label font_size len_words ... text_area_ratio image_area Effectiveness
2 1.0 11.5 9.0 ... 0.04 310100.0 21.2