I have two dataframe, df1 and df2,, df1 contains correct data that will be used to match data in df2
I want to find latitudes and longitudes in df2 that don't match the City name in df1.
Also I want to find cities in df2 that are "located" in the wrong country
id city lat lng country
1036323110 Katherine -14.4667 132.2667 Australia
1840015979 South Pasadena 27.7526 -82.7394 United States
1124755118 Beaconsfield 45.4333 -73.8667 Canada
1250921305 Ferney-Voltaire 46.2558 6.1081 France
1156346497 Jiangshan 28.7412 118.6225 China
1231393325 Dīla 6.4104 38.3100 Ethiopia
1192391794 Gibara 21.1072 -76.1367 Cuba
1840054954 Hampstead 42.8821 -71.1709 United States
1840005111 West Islip 40.7097 -73.2971 United States
1076327352 Paulínia -22.7611 -47.1542 Brazil
id location city country
16620625-5686 45.5333, -73.2833 Saint-Basile-le-Grand Canada
16310427-5502 52.0000, 84.9833 Belokurikha Russia
16501010-4957 -14.4667, 136.2667 Katherine Australia
16110430-8679 40.5626, -74.5743 Finderne United States
16990624-4174 27.7526, -90.7394 South Pasadena China
16790311-9092 35.98157, -160.41182 Jiangshan United States
16650927-9151 44.7667, 39.8667 West Islip Russia
16530328-2221 -22.8858, -48.4450 Botucatu Brazil
16411229-7314 42.8821, -71.1709 Hampstead United States
16060229-4175 -7.7296, 38.9500 Kibiti Tanzania
Here' my code so far:
city_df = pd.merge(df1,df2,on ='city',how ='left')
CodePudding user response:
First add lat
and lng
columns to df2
df2[['lat', 'lng']] = df2['location'].str.split(', ', expand=True)
df2[['lat', 'lng']] = df2[['lat', 'lng']].astype(float)
Then merge df1
to df2
based on cities
city_df = pd.merge(df1[['lat', 'lng', 'city', 'country']], df2, on='city', how ='right', suffixes=('_correct', ''))
Find cities in df2 that are "located" in the wrong country
m = ~((city_df['country_correct'] == city_df['country']) | city_df['country_correct'].isna())
print(city_df[m])
lat_correct lng_correct city country_correct id location country lat lng
4 27.7526 -82.7394 South Pasadena United States 16990624-4174 27.7526, -90.7394 China 27.75260 -90.73940
5 28.7412 118.6225 Jiangshan China 16790311-9092 35.98157, -160.41182 United States 35.98157 -160.41182
6 40.7097 -73.2971 West Islip United States 16650927-9151 44.7667, 39.8667 Russia 44.76670 39.86670
CodePudding user response:
To compare the two data frames, it's easier to have in the first place the df1 and df2 in similar formats. For example, df1 would be like this :
lat lng country
city
Katherine -14.4667 132.2667 Australia
South Pasadena 27.7526 -82.7394 United States
Beaconsfield 45.4333 -73.8667 Canada
Ferney-Voltaire 46.2558 6.1081 France
Jiangshan 28.7412 118.6225 China
Dīla 6.4104 38.3100 Ethiopia
Gibara 21.1072 -76.1367 Cuba
Hampstead 42.8821 -71.1709 United States
West Islip 40.7097 -73.2971 United States
Paulínia -22.7611 -47.1542 Brazil
And df2 :
country2 lng2 lat2
city
Saint-Basile-le-Grand Canada -73.2833 45.5333
Belokurikha Russia 84.9833 52.0000
Katherine Australia 132.2667 -14.4667
Finderne United States -74.5743 40.5626
South Pasadena United States -82.7394 27.7526
West Islip United States -160.41182 35.98157
Belorechensk Russia 39.8667 44.7667
Botucatu Brazil -48.4450 -22.8858
Hampstead United States -71.1709 42.8821
Kibiti Tanzania 38.9500 -7.7296
Then you can use the pd.concat
method on axis=1
as follow :
df3 = pd.concat([df1,df2],axis=1)
in order to get the following df :
lat lng country country2 lng2 lat2
city
Katherine -14.4667 132.2667 Australia Australia 132.2667 -14.4667
South Pasadena 27.7526 -82.7394 United States United States -82.7394 27.7526
Beaconsfield 45.4333 -73.8667 Canada NaN NaN NaN
Ferney-Voltaire 46.2558 6.1081 France NaN NaN NaN
Jiangshan 28.7412 118.6225 China NaN NaN NaN
Dīla 6.4104 38.3100 Ethiopia NaN NaN NaN
Gibara 21.1072 -76.1367 Cuba NaN NaN NaN
Hampstead 42.8821 -71.1709 United States United States -71.1709 42.8821
West Islip 40.7097 -73.2971 United States United States -160.41182 35.98157
Paulínia -22.7611 -47.1542 Brazil NaN NaN NaN
Saint-Basile-le-Grand NaN NaN NaN Canada -73.2833 45.5333
Belokurikha NaN NaN NaN Russia 84.9833 52.0000
Finderne NaN NaN NaN United States -74.5743 40.5626
Belorechensk NaN NaN NaN Russia 39.8667 44.7667
Botucatu NaN NaN NaN Brazil -48.4450 -22.8858
Kibiti NaN NaN NaN Tanzania 38.9500 -7.7296
Finaly from the concatenated df3 you can get rows where latitudes and longitudes in df2 don't match the City name in df1 :
df3[(df3['lat']!=df3['lat2']) & (df3['lng']!=df3['lng2'])].dropna()
lat lng country country2 lng2 lat2
city
West Islip 40.7097 -73.2971 United States United States -160.41182 35.98157
To find cities in df2 that are "located" in the wrong country :
df3[df3['country']!=df3['country2']]