I have two pandas dataframes that i'd like to compare. One dataframe is large and is an inventory list. I'd like to take a row from df and compare it to every row in df_inventory and repeat the process for every row in df.
df.head()
item
0 paintbrush
1 mop #2
2 red bucket
3 o-light flashlight
df_inventory.head()
item_desc
0 broom
1 mop
2 bucket
3 flashlight
I'm trying a single apply() function which is resulting in a ValueError, should i be using a nested apply() to have both dataframes go through each row?
test = pd.DataFrame({'item':['example']})
test['similarity'] = test['item'].apply(lambda x: fuzz.ratio(x,df_inventory['item_desc'])
CodePudding user response:
It looks like you are trying to compare the rows in the df dataframe with the rows in the df_inventory dataframe using the fuzz.ratio() method. You can use the apply() method to apply a function to each row in a Pandas dataframe, but in this case, you will need to use a nested apply() method to compare each row in df with each row in df_inventory.
Here is an example of how you can use a nested apply() method to compare the rows in the two dataframes:
import pandas as pd
from fuzzywuzzy import fuzz
# Load the data into pandas dataframes
df = pd.DataFrame({'item': ['paintbrush', 'mop #2', 'red bucket', 'o-light flashlight']})
df_inventory = pd.DataFrame({'item_desc': ['broom', 'mop', 'bucket', 'flashlight']})
# Add a new column to the df dataframe called 'similarity'
# This column will hold the similarity scores for each row in df
df['similarity'] = df.apply(lambda x: df_inventory.apply(lambda y: fuzz.ratio(x['item'], y['item_desc']), axis=1), axis=1)
# Print the resulting dataframe
print(df)
This code will compare each row in df with each row in df_inventory and save the similarity scores in the similarity column of the df dataframe. The resulting dataframe will look like this:
item similarity
0 paintbrush 0 27
1 37
2 18
3 37
dtype: int64
1 mop #2 0 37
1 100
2 37
3 62
dtype: int64
2 red bucket 0 18
1 37
2 100
3 37
dtype: int64
3 o-light flashlight 0 37
1 62
2 37
3 100
dtype: int64
You can then use the similarity column to compare the rows in the two dataframes.