Home > database >  How to replace using for() with all() in a pandas dataframe?
How to replace using for() with all() in a pandas dataframe?

Time:02-21

I have a university activity that makes the following dataframe available:

      import pandas as pd
      from fuzzywuzzy import fuzz
      from fuzzywuzzy import process

      df1 = pd.DataFrame({'id': [0,1,2,3],
                          'name_city': ['RIO DE JANEIRO', 'SAO PAULO', 
                                        'ITU', 'CURITIBA'],
                          'city':['RIO JANEIRO', 'SAOO PAULO', 
                                  'FLORIANOPOLIS', 'BELO HORIZONTE']})

      print(df1)

            id  name_city             city
            0   RIO DE JANEIRO     RIO JANEIRO
            1   SAO PAULO          SAOO PAULO
            2   ITU                FLORIANOPOLIS
            3   CURITIBA           BELO HORIZONTE

I need to assess what is the similarity between the names of cities. So I'm using the fuzzy library and I made the following code:

      for i in range(0, len(df1)):
         df1['value_fuzzy'].iloc[i] = fuzz.ratio(df1['name_city'].iloc[i], df1['city'].iloc[i]

      This code works perfectly, the output is as desired:

      print(df1)

           id   name_city           city               value_fuzzy
           0    RIO DE JANEIRO     RIO JANEIRO             88
           1    SAO PAULO          SAOO PAULO              95
           2    ITU                FLORIANOPOLIS           12
           3    CURITIBA           BELO HORIZONTE          27

However, I would like to replace the for. So I tried to use all() which I found this example: Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

        df1['value_fuzzy'] = fuzz.ratio(df1['name_city'].all(), df1['city'].all())

But this code with all() returns wrong output:

       id   name_city        city          value_fuzzy
       0    RIO DE JANEIRO  RIO JANEIRO     27
       1    SAO PAULO       SAOO PAULO      27
       2    ITU             FLORIANOPOLIS   27
       3    CURITIBA        BELO HORIZONTE  27

Is there another way to generate the desired output without using for()?

CodePudding user response:

You can't use fuzz.ratio this way directly, the function is not vectorial. You need to pass it to apply:

df1['value_fuzzy'] = df1.apply(lambda r: fuzz.ratio(r['name_city'], r['city']), axis=1)

NB. Note that whatever you do, unless the function is specifically rewritten to handle vectorial input, that won't improve efficiency

output:

   id       name_city            city  value_fuzzy
0   0  RIO DE JANEIRO     RIO JANEIRO           88
1   1       SAO PAULO      SAOO PAULO           95
2   2             ITU   FLORIANOPOLIS           12
3   3        CURITIBA  BELO HORIZONTE           27

CodePudding user response:

df1['value_fuzzy'] = df1.apply(lambda row : fuzz.ratio(row['name_city'],
                     row['city']), axis = 1)

CodePudding user response:

Another way to do it with list comprehension:

df1['value_fuzzy'] = [fuzz.ratio(df1['name_city'][i],df1['city'][i]) for i in range(len(df1.index))]

Output:

df1

   id        name_city            city  value_fuzzy
0   0   RIO DE JANEIRO     RIO JANEIRO           88
1   1        SAO PAULO      SAOO PAULO           95
2   2              ITU   FLORIANOPOLIS           12
3   3         CURITIBA  BELO HORIZONTE           27   
  • Related