Home > Back-end >  Fast way to compare two columns containing country names written differently
Fast way to compare two columns containing country names written differently

Time:10-12

A solution in Pandas or Spark is welcomed either way, I am interested in the logic.

My dataframes:

df_1=
col_1    col_2    country
65783    75838    UNITED STATES
57637    83758    UNITED KINGDOM
73456    25356    KOREA, REP. OF
48577    23589    GHANA
48575    24389    SURINAME

df_2 =
col_1    col_2    country
65783    75838    United States of America
57637    83758    England
73456    25356    South Korea
48577    23589    Ghana
48575    24389    England

General code to compare such dataframes (it works):

import pandas as pd

def matching(df_1, df_2):
    df_new = df_2.merge(df_1, on=['col_1', 'col_2'], suffixes=(None, '_actual')).query('country != country_actual')
    return df_new

Obviously, only the last row is a mis-match, but given the fact that they are written based on a different convention and in reality I have hundreds of countries, how can I somehow bring them together so that I make a sensible comparison? I know how to change the value one by one, but it is impossible for hundreds of them.

CodePudding user response:

I don't know if there is a easy way, but the country_converter library can help you. It don't reconize England, but you can change manually the errors:

import country_converter as coco
some_names = ['United States of America', 'UNITED KINGDOM', 'South Korea', 'Ghana', 'SURINAME',
    'KOREA, REP. OF', 'UNITED STATES', 'GHANA']
standard_names = coco.convert(names=some_names, to='name_short')
print(standard_names)

CodePudding user response:

Have you tried fuzz matching? I have had a similar problem, I came up with this. https://github.com/hansalemaos/a_pandas_ex_fuzz/blob/main/__init__.py

It worked for me.

  • Related