How to replace using for() with all() in a pandas dataframe?-CodePudding

I have a university activity that makes the following dataframe available:

      import pandas as pd
      from fuzzywuzzy import fuzz
      from fuzzywuzzy import process

      df1 = pd.DataFrame({'id': [0,1,2,3],
                          'name_city': ['RIO DE JANEIRO', 'SAO PAULO', 
                                        'ITU', 'CURITIBA'],
                          'city':['RIO JANEIRO', 'SAOO PAULO', 
                                  'FLORIANOPOLIS', 'BELO HORIZONTE']})

      print(df1)

            id  name_city             city
            0   RIO DE JANEIRO     RIO JANEIRO
            1   SAO PAULO          SAOO PAULO
            2   ITU                FLORIANOPOLIS
            3   CURITIBA           BELO HORIZONTE

I need to assess what is the similarity between the names of cities. So I'm using the fuzzy library and I made the following code:

      for i in range(0, len(df1)):
         df1['value_fuzzy'].iloc[i] = fuzz.ratio(df1['name_city'].iloc[i], df1['city'].iloc[i]

      This code works perfectly, the output is as desired:

      print(df1)

           id   name_city           city               value_fuzzy
           0    RIO DE JANEIRO     RIO JANEIRO             88
           1    SAO PAULO          SAOO PAULO              95
           2    ITU                FLORIANOPOLIS           12
           3    CURITIBA           BELO HORIZONTE          27

However, I would like to replace the for. So I tried to use all() which I found this example: Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

        df1['value_fuzzy'] = fuzz.ratio(df1['name_city'].all(), df1['city'].all())

But this code with all() returns wrong output:

       id   name_city        city          value_fuzzy
       0    RIO DE JANEIRO  RIO JANEIRO     27
       1    SAO PAULO       SAOO PAULO      27
       2    ITU             FLORIANOPOLIS   27
       3    CURITIBA        BELO HORIZONTE  27

Is there another way to generate the desired output without using for()?

CodePudding user response：

You can't use fuzz.ratio this way directly, the function is not vectorial. You need to pass it to apply:

df1['value_fuzzy'] = df1.apply(lambda r: fuzz.ratio(r['name_city'], r['city']), axis=1)

NB. Note that whatever you do, unless the function is specifically rewritten to handle vectorial input, that won't improve efficiency

output:

   id       name_city            city  value_fuzzy
0   0  RIO DE JANEIRO     RIO JANEIRO           88
1   1       SAO PAULO      SAOO PAULO           95
2   2             ITU   FLORIANOPOLIS           12
3   3        CURITIBA  BELO HORIZONTE           27

CodePudding user response：

df1['value_fuzzy'] = df1.apply(lambda row : fuzz.ratio(row['name_city'],
                     row['city']), axis = 1)

CodePudding user response：

Another way to do it with list comprehension:

df1['value_fuzzy'] = [fuzz.ratio(df1['name_city'][i],df1['city'][i]) for i in range(len(df1.index))]

Output:

df1

   id        name_city            city  value_fuzzy
0   0   RIO DE JANEIRO     RIO JANEIRO           88
1   1        SAO PAULO      SAOO PAULO           95
2   2              ITU   FLORIANOPOLIS           12
3   3         CURITIBA  BELO HORIZONTE           27