A solution in Pandas or Spark is welcomed either way, I am interested in the logic.
My dataframes:
df_1=
col_1 col_2 country
65783 75838 UNITED STATES
57637 83758 UNITED KINGDOM
73456 25356 KOREA, REP. OF
48577 23589 GHANA
48575 24389 SURINAME
df_2 =
col_1 col_2 country
65783 75838 United States of America
57637 83758 England
73456 25356 South Korea
48577 23589 Ghana
48575 24389 England
General code to compare such dataframes (it works):
import pandas as pd
def matching(df_1, df_2):
df_new = df_2.merge(df_1, on=['col_1', 'col_2'], suffixes=(None, '_actual')).query('country != country_actual')
return df_new
Obviously, only the last row is a mis-match, but given the fact that they are written based on a different convention and in reality I have hundreds of countries, how can I somehow bring them together so that I make a sensible comparison? I know how to change the value one by one, but it is impossible for hundreds of them.
CodePudding user response:
I don't know if there is a easy way, but the country_converter library can help you. It don't reconize England, but you can change manually the errors:
import country_converter as coco
some_names = ['United States of America', 'UNITED KINGDOM', 'South Korea', 'Ghana', 'SURINAME',
'KOREA, REP. OF', 'UNITED STATES', 'GHANA']
standard_names = coco.convert(names=some_names, to='name_short')
print(standard_names)
CodePudding user response:
Have you tried fuzz matching? I have had a similar problem, I came up with this. https://github.com/hansalemaos/a_pandas_ex_fuzz/blob/main/__init__.py
It worked for me.