I have the large CSVs with following sample dataframes:
df1 =
Index Fruit Vegetable
0 Mango Spinach
1 Berry Carrot
2 Banana Cabbage
df2 =
Index Unit Price
0 Mango_123 30
1 234_Artichoke_CE 45
2 23_Banana 12
3 Berry___LE 10
4 Cabbage___12LW 25
5 Rice_ww_12 40
6 Spinach_KJ 34
7 234_Carrot_23 08
8 10000_Lentil 12
9 Pot________12 32
I would like to replace the names in df2 to replace the names in df1 to create the following dataframe:
df3=
Index Fruit Vegetable
0 Mango_123 Spinach_KJ
1 Berry___LE 234_Carrot_23
2 23_Banana Cabbage___12LW
What would be a generic way to do this? Thank you.
CodePudding user response:
You can use fuzzy matching with thefuzz.process.extractOne
, that will compute the closest match using Levenshtein Distance:
# pip install thefuzz
from thefuzz import process
cols = ['Fruit', 'Vegetable']
df1[cols] = df1[cols].applymap(lambda x: process.extractOne(x, df2['Unit'])[0])
output:
Index Fruit Vegetable
0 0 Mango_123 Spinach_KJ
1 1 Berry___LE 234_Carrot_23
2 2 23_Banana Cabbage___12LW