Home > front end >  How to remove rows from dataframe where data from another dataframe DOESN'T match
How to remove rows from dataframe where data from another dataframe DOESN'T match

Time:01-29

So I have two dataframes:


gp_df = CCA3            Country/Territory  year       GDP_USD
2662   AFG                  Afghanistan  1970  1.748887e 09
2661   AFE  Africa Eastern and Southern  1970  4.486261e 10
2663   AFW   Africa Western and Central  1970  2.350461e 10
2665   ALB                      Albania  1970           NaN
2720   DZA                      Algeria  1970  4.863487e 09
...    ...                          ...   ...           ...
16156  PSE           West Bank and Gaza  2020  1.553170e 10
16219  WLD                        World  2020  8.490680e 13
16222  YEM                  Yemen, Rep.  2020  1.884051e 10
16224  ZMB                       Zambia  2020  1.811063e 10
16225  ZWE                     Zimbabwe  2020  1.805117e 10


pp_df = CCA3  Country/Territory Continent  2020 Population  1970 Population 
0    AFG        Afghanistan      Asia         38972230         10752971   
1    ALB            Albania    Europe          2866849          2324731   
2    DZA            Algeria    Africa         43451666         13795915   
3    ASM     American Samoa   Oceania            46189            27075   
4    AND            Andorra    Europe            77700            19860   
..   ...                ...       ...              ...              ...   
229  WLF  Wallis and Futuna   Oceania            11655             9377   
230  ESH     Western Sahara    Africa           556048            76371   
231  YEM              Yemen      Asia         32284046          6843607   
232  ZMB             Zambia    Africa         18927715          4281671   
233  ZWE           Zimbabwe    Africa         15669666          5202918   

I want to find a way to remove all the Countrys/Territorys from gp_df that aren't in pp_df. I've already tried using .drop() and np.where() to try and locate the duplicates then drop them but I can't seem to get the syntax correct.

CodePudding user response:

Does this work...

gp_df[~gp_df['Country/Territory'].isin(pp_df['Country/Territory'])]

CodePudding user response:

Dataframes to work on

gp_df = pd.DataFrame({'CCA3': ["AFG" ,  "AFE",  "AFW", "ALB",  "DZA"], 
                    'Country/Territory': ["Afghanistan" ,  "Africa Eastern and Southern",  
                                           "Africa Western and Central", "Albania",  "Algeria"], 
                    'year': [1970, 1970, 1970, 1970, 1970], 
                    'GDP_USD':[1.748887e 09, 4.486261e 10, 2.350461e 10, pd.NA, 4.863487e 09],})

#   CCA3           Country/Territory   year        GDP_USD
# 0  AFG                  Afghanistan  1970   1748887000.0
# 1  AFE  Africa Eastern and Southern  1970  44862610000.0
# 2  AFW   Africa Western and Central  1970  23504610000.0
# 3  ALB                      Albania  1970           <NA>
# 4  DZA                      Algeria  1970   4863487000.0

pp_df = pd.DataFrame({'CCA3': ["AFG" ,  "ALB",  "DZA", "ASM",  "AND"], 
                    'Country/Territory': ["Afghanistan" ,  "Albania",  
                                           "Algeria", "American Samoa",  "Andorra"], 
                    'Continent': ["Asia", "Europe", "Africa", "Oceania", "Europe"], 
                    '2020 Population':[38972230, 2866849, 43451666, 46189, 77700],
                    '1970 Population':[10752971, 2324731, 13795915, 27075, 19860],
                    })

#   CCA3 Country/Territory Continent  2020 Population  1970 Population
# 0  AFG       Afghanistan      Asia         38972230         10752971
# 1  ALB           Albania    Europe          2866849          2324731
# 2  DZA           Algeria    Africa         43451666         13795915
# 3  ASM    American Samoa   Oceania            46189            27075
# 4  AND           Andorra    Europe            77700            19860

Complete script for checkings

import pandas as pd

gp_df = pd.DataFrame({'CCA3': ["AFG" ,  "AFE",  "AFW", "ALB",  "DZA"], 
                    'Country/Territory': ["Afghanistan" ,  "Africa Eastern and Southern",  
                                           "Africa Western and Central", "Albania",  "Algeria"], 
                    'year': [1970, 1970, 1970, 1970, 1970], 
                    'GDP_USD':[1.748887e 09, 4.486261e 10, 2.350461e 10, pd.NA, 4.863487e 09],})

pp_df = pd.DataFrame({'CCA3': ["AFG" ,  "ALB",  "DZA", "ASM",  "AND"], 
                    'Country/Territory': ["Afghanistan" ,  "Albania",  
                                           "Algeria", "American Samoa",  "Andorra"], 
                    'Continent': ["Asia", "Europe", "Africa", "Oceania", "Europe"], 
                    '2020 Population':[38972230, 2866849, 43451666, 46189, 77700],
                    '1970 Population':[10752971, 2324731, 13795915, 27075, 19860],
                    })

uKeys_1 = pp_df['Country/Territory'].unique()

r = gp_df[gp_df['Country/Territory'].isin(uKeys_1)]

print(r)

Result

#   CCA3 Country/Territory  year       GDP_USD
# 0  AFG       Afghanistan  1970  1748887000.0
# 3  ALB           Albania  1970          <NA>
# 4  DZA           Algeria  1970  4863487000.0
  • Related