Home > Enterprise >  Compare two dataframe in pandas and edit results
Compare two dataframe in pandas and edit results

Time:04-11

I have two dataframe, df1 and df2,, df1 contains correct data that will be used to match data in df2

I want to find latitudes and longitudes in df2 that don't match the City name in df1.

Also I want to find cities in df2 that are "located" in the wrong country

Here's df1 dataframe

id city lat lng country

1036323110 Katherine -14.4667 132.2667 Australia

1840015979 South Pasadena 27.7526 -82.7394 United States

1124755118 Beaconsfield 45.4333 -73.8667 Canada

1250921305 Ferney-Voltaire 46.2558 6.1081 France

1156346497 Jiangshan 28.7412 118.6225 China

1231393325 Dīla 6.4104 38.3100 Ethiopia

1192391794 Gibara 21.1072 -76.1367 Cuba

1840054954 Hampstead 42.8821 -71.1709 United States

1840005111 West Islip 40.7097 -73.2971 United States

1076327352 Paulínia -22.7611 -47.1542 Brazil

Here's df2 dataframe

id location city country

16620625-5686 45.5333, -73.2833 Saint-Basile-le-Grand Canada

16310427-5502 52.0000, 84.9833 Belokurikha Russia

16501010-4957 -14.4667, 136.2667 Katherine Australia

16110430-8679 40.5626, -74.5743 Finderne United States

16990624-4174 27.7526, -90.7394 South Pasadena China

16790311-9092 35.98157, -160.41182 Jiangshan United States

16650927-9151 44.7667, 39.8667 West Islip Russia

16530328-2221 -22.8858, -48.4450 Botucatu Brazil

16411229-7314 42.8821, -71.1709 Hampstead United States

16060229-4175 -7.7296, 38.9500 Kibiti Tanzania

Here' my code so far:

city_df = pd.merge(df1,df2,on ='city',how ='left')

CodePudding user response:

First add lat and lng columns to df2

df2[['lat', 'lng']] = df2['location'].str.split(', ', expand=True)
df2[['lat', 'lng']] = df2[['lat', 'lng']].astype(float)

Then merge df1 to df2 based on cities

city_df = pd.merge(df1[['lat', 'lng', 'city', 'country']], df2, on='city', how ='right', suffixes=('_correct', ''))

Find cities in df2 that are "located" in the wrong country

m = ~((city_df['country_correct'] == city_df['country']) | city_df['country_correct'].isna())
print(city_df[m])

   lat_correct  lng_correct            city country_correct             id              location        country       lat        lng
4      27.7526     -82.7394  South Pasadena   United States  16990624-4174     27.7526, -90.7394          China  27.75260  -90.73940
5      28.7412     118.6225       Jiangshan           China  16790311-9092  35.98157, -160.41182  United States  35.98157 -160.41182
6      40.7097     -73.2971      West Islip   United States  16650927-9151      44.7667, 39.8667         Russia  44.76670   39.86670

CodePudding user response:

To compare the two data frames, it's easier to have in the first place the df1 and df2 in similar formats. For example, df1 would be like this :

                     lat       lng        country
city                                             
Katherine       -14.4667  132.2667      Australia
South Pasadena   27.7526  -82.7394  United States
Beaconsfield     45.4333  -73.8667         Canada
Ferney-Voltaire  46.2558    6.1081         France
Jiangshan        28.7412  118.6225          China
Dīla              6.4104   38.3100       Ethiopia
Gibara           21.1072  -76.1367           Cuba
Hampstead        42.8821  -71.1709  United States
West Islip       40.7097  -73.2971  United States
Paulínia        -22.7611  -47.1542         Brazil

And df2 :

                            country2         lng2      lat2
city                                                       
Saint-Basile-le-Grand         Canada     -73.2833   45.5333
Belokurikha                   Russia      84.9833   52.0000
Katherine                  Australia     132.2667  -14.4667
Finderne               United States     -74.5743   40.5626
South Pasadena         United States     -82.7394   27.7526
West Islip             United States   -160.41182  35.98157
Belorechensk                  Russia      39.8667   44.7667
Botucatu                      Brazil     -48.4450  -22.8858
Hampstead              United States     -71.1709   42.8821
Kibiti                      Tanzania      38.9500   -7.7296

Then you can use the pd.concat method on axis=1 as follow :

df3 = pd.concat([df1,df2],axis=1) in order to get the following df :

                           lat       lng        country       country2         lng2      lat2
city                                                                                         
Katherine             -14.4667  132.2667      Australia      Australia     132.2667  -14.4667
South Pasadena         27.7526  -82.7394  United States  United States     -82.7394   27.7526
Beaconsfield           45.4333  -73.8667         Canada            NaN          NaN       NaN
Ferney-Voltaire        46.2558    6.1081         France            NaN          NaN       NaN
Jiangshan              28.7412  118.6225          China            NaN          NaN       NaN
Dīla                    6.4104   38.3100       Ethiopia            NaN          NaN       NaN
Gibara                 21.1072  -76.1367           Cuba            NaN          NaN       NaN
Hampstead              42.8821  -71.1709  United States  United States     -71.1709   42.8821
West Islip             40.7097  -73.2971  United States  United States   -160.41182  35.98157
Paulínia              -22.7611  -47.1542         Brazil            NaN          NaN       NaN
Saint-Basile-le-Grand      NaN       NaN            NaN         Canada     -73.2833   45.5333
Belokurikha                NaN       NaN            NaN         Russia      84.9833   52.0000
Finderne                   NaN       NaN            NaN  United States     -74.5743   40.5626
Belorechensk               NaN       NaN            NaN         Russia      39.8667   44.7667
Botucatu                   NaN       NaN            NaN         Brazil     -48.4450  -22.8858
Kibiti                     NaN       NaN            NaN       Tanzania      38.9500   -7.7296

Finaly from the concatenated df3 you can get rows where latitudes and longitudes in df2 don't match the City name in df1 :

df3[(df3['lat']!=df3['lat2']) & (df3['lng']!=df3['lng2'])].dropna()

                    lat      lng        country       country2         lng2      lat2
city                                                                             
West Islip  40.7097 -73.2971  United States  United States   -160.41182  35.98157

To find cities in df2 that are "located" in the wrong country :

 df3[df3['country']!=df3['country2']]
  • Related