Home > Back-end >  How to print just selected substrings (contained in a dataframe column) setting conditions with pand
How to print just selected substrings (contained in a dataframe column) setting conditions with pand

Time:03-23

Hy everyone. I have a huge dataset in which I have several nations indicated by ISO-codes. Anyway, there are some nations that are shown with their official name but not with ISO code. I want to find them and then replace them with respective iso-codes.

This is the example of the df I have:

| TERRITORY               |
 -----------------------
| IT, GB, USA, France     |
| ES, Russia, Germany, PT |
| EG, LY, DZ              |

Expected output:

'The nations that were not converted are:' France, Russia, Germany

The big problem is that these nations are in the same cells and are considered as a single value. I thought to ask the program to print just that substrings which are longer than two characters, but after different attempts, I didn't get anything.

Could someone help me?

CodePudding user response:

IIUC, you could split explode and map to a known list of codes (here using pycountry):

import pycountry
codes = {c.alpha_2 for c in pycountry.countries}
# or manually set
# codes = {'IT', 'GB', 'USA', 'FR'...}

s = df['TERRITORY'].str.split(', ').explode().drop_duplicates()
print(f'The nations that were not converted are: {", ".join(s[~s.isin(codes)])}')

output:

The nations that were not converted are: USA, France, Russia, Germany
  • Related