I have a university activity that makes the following dataframe available:
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df1 = pd.DataFrame({'id': [0,1,2,3],
'name_city': ['RIO DE JANEIRO', 'SAO PAULO',
'ITU', 'CURITIBA'],
'city':['RIO JANEIRO', 'SAOO PAULO',
'FLORIANOPOLIS', 'BELO HORIZONTE']})
print(df1)
id name_city city
0 RIO DE JANEIRO RIO JANEIRO
1 SAO PAULO SAOO PAULO
2 ITU FLORIANOPOLIS
3 CURITIBA BELO HORIZONTE
I need to assess what is the similarity between the names of cities. So I'm using the fuzzy library and I made the following code:
for i in range(0, len(df1)):
df1['value_fuzzy'].iloc[i] = fuzz.ratio(df1['name_city'].iloc[i], df1['city'].iloc[i]
This code works perfectly, the output is as desired:
print(df1)
id name_city city value_fuzzy
0 RIO DE JANEIRO RIO JANEIRO 88
1 SAO PAULO SAOO PAULO 95
2 ITU FLORIANOPOLIS 12
3 CURITIBA BELO HORIZONTE 27
However, I would like to replace the for. So I tried to use all() which I found this example: Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
df1['value_fuzzy'] = fuzz.ratio(df1['name_city'].all(), df1['city'].all())
But this code with all() returns wrong output:
id name_city city value_fuzzy
0 RIO DE JANEIRO RIO JANEIRO 27
1 SAO PAULO SAOO PAULO 27
2 ITU FLORIANOPOLIS 27
3 CURITIBA BELO HORIZONTE 27
Is there another way to generate the desired output without using for()?
CodePudding user response:
You can't use fuzz.ratio
this way directly, the function is not vectorial.
You need to pass it to apply
:
df1['value_fuzzy'] = df1.apply(lambda r: fuzz.ratio(r['name_city'], r['city']), axis=1)
NB. Note that whatever you do, unless the function is specifically rewritten to handle vectorial input, that won't improve efficiency
output:
id name_city city value_fuzzy
0 0 RIO DE JANEIRO RIO JANEIRO 88
1 1 SAO PAULO SAOO PAULO 95
2 2 ITU FLORIANOPOLIS 12
3 3 CURITIBA BELO HORIZONTE 27
CodePudding user response:
df1['value_fuzzy'] = df1.apply(lambda row : fuzz.ratio(row['name_city'],
row['city']), axis = 1)
CodePudding user response:
Another way to do it with list comprehension:
df1['value_fuzzy'] = [fuzz.ratio(df1['name_city'][i],df1['city'][i]) for i in range(len(df1.index))]
Output:
df1
id name_city city value_fuzzy
0 0 RIO DE JANEIRO RIO JANEIRO 88
1 1 SAO PAULO SAOO PAULO 95
2 2 ITU FLORIANOPOLIS 12
3 3 CURITIBA BELO HORIZONTE 27