So I have two dataframes:
gp_df = CCA3 Country/Territory year GDP_USD
2662 AFG Afghanistan 1970 1.748887e 09
2661 AFE Africa Eastern and Southern 1970 4.486261e 10
2663 AFW Africa Western and Central 1970 2.350461e 10
2665 ALB Albania 1970 NaN
2720 DZA Algeria 1970 4.863487e 09
... ... ... ... ...
16156 PSE West Bank and Gaza 2020 1.553170e 10
16219 WLD World 2020 8.490680e 13
16222 YEM Yemen, Rep. 2020 1.884051e 10
16224 ZMB Zambia 2020 1.811063e 10
16225 ZWE Zimbabwe 2020 1.805117e 10
pp_df = CCA3 Country/Territory Continent 2020 Population 1970 Population
0 AFG Afghanistan Asia 38972230 10752971
1 ALB Albania Europe 2866849 2324731
2 DZA Algeria Africa 43451666 13795915
3 ASM American Samoa Oceania 46189 27075
4 AND Andorra Europe 77700 19860
.. ... ... ... ... ...
229 WLF Wallis and Futuna Oceania 11655 9377
230 ESH Western Sahara Africa 556048 76371
231 YEM Yemen Asia 32284046 6843607
232 ZMB Zambia Africa 18927715 4281671
233 ZWE Zimbabwe Africa 15669666 5202918
I want to find a way to remove all the Countrys/Territorys from gp_df that aren't in pp_df. I've already tried using .drop() and np.where() to try and locate the duplicates then drop them but I can't seem to get the syntax correct.
CodePudding user response:
Does this work...
gp_df[~gp_df['Country/Territory'].isin(pp_df['Country/Territory'])]
CodePudding user response:
Dataframes to work on
gp_df = pd.DataFrame({'CCA3': ["AFG" , "AFE", "AFW", "ALB", "DZA"],
'Country/Territory': ["Afghanistan" , "Africa Eastern and Southern",
"Africa Western and Central", "Albania", "Algeria"],
'year': [1970, 1970, 1970, 1970, 1970],
'GDP_USD':[1.748887e 09, 4.486261e 10, 2.350461e 10, pd.NA, 4.863487e 09],})
# CCA3 Country/Territory year GDP_USD
# 0 AFG Afghanistan 1970 1748887000.0
# 1 AFE Africa Eastern and Southern 1970 44862610000.0
# 2 AFW Africa Western and Central 1970 23504610000.0
# 3 ALB Albania 1970 <NA>
# 4 DZA Algeria 1970 4863487000.0
pp_df = pd.DataFrame({'CCA3': ["AFG" , "ALB", "DZA", "ASM", "AND"],
'Country/Territory': ["Afghanistan" , "Albania",
"Algeria", "American Samoa", "Andorra"],
'Continent': ["Asia", "Europe", "Africa", "Oceania", "Europe"],
'2020 Population':[38972230, 2866849, 43451666, 46189, 77700],
'1970 Population':[10752971, 2324731, 13795915, 27075, 19860],
})
# CCA3 Country/Territory Continent 2020 Population 1970 Population
# 0 AFG Afghanistan Asia 38972230 10752971
# 1 ALB Albania Europe 2866849 2324731
# 2 DZA Algeria Africa 43451666 13795915
# 3 ASM American Samoa Oceania 46189 27075
# 4 AND Andorra Europe 77700 19860
Complete script for checkings
import pandas as pd
gp_df = pd.DataFrame({'CCA3': ["AFG" , "AFE", "AFW", "ALB", "DZA"],
'Country/Territory': ["Afghanistan" , "Africa Eastern and Southern",
"Africa Western and Central", "Albania", "Algeria"],
'year': [1970, 1970, 1970, 1970, 1970],
'GDP_USD':[1.748887e 09, 4.486261e 10, 2.350461e 10, pd.NA, 4.863487e 09],})
pp_df = pd.DataFrame({'CCA3': ["AFG" , "ALB", "DZA", "ASM", "AND"],
'Country/Territory': ["Afghanistan" , "Albania",
"Algeria", "American Samoa", "Andorra"],
'Continent': ["Asia", "Europe", "Africa", "Oceania", "Europe"],
'2020 Population':[38972230, 2866849, 43451666, 46189, 77700],
'1970 Population':[10752971, 2324731, 13795915, 27075, 19860],
})
uKeys_1 = pp_df['Country/Territory'].unique()
r = gp_df[gp_df['Country/Territory'].isin(uKeys_1)]
print(r)
Result
# CCA3 Country/Territory year GDP_USD
# 0 AFG Afghanistan 1970 1748887000.0
# 3 ALB Albania 1970 <NA>
# 4 DZA Algeria 1970 4863487000.0