Hy everyone. I have a huge dataset in which I have several nations indicated by ISO-codes. Anyway, there are some nations that are shown with their official name but not with ISO code. I want to find them and then replace them with respective iso-codes.
This is the example of the df I have:
| TERRITORY |
-----------------------
| IT, GB, USA, France |
| ES, Russia, Germany, PT |
| EG, LY, DZ |
Expected output:
'The nations that were not converted are:' France, Russia, Germany
The big problem is that these nations are in the same cells and are considered as a single value. I thought to ask the program to print just that substrings which are longer than two characters, but after different attempts, I didn't get anything.
Could someone help me?
CodePudding user response:
IIUC, you could split
explode
and map to a known list of codes (here using pycountry
):
import pycountry
codes = {c.alpha_2 for c in pycountry.countries}
# or manually set
# codes = {'IT', 'GB', 'USA', 'FR'...}
s = df['TERRITORY'].str.split(', ').explode().drop_duplicates()
print(f'The nations that were not converted are: {", ".join(s[~s.isin(codes)])}')
output:
The nations that were not converted are: USA, France, Russia, Germany