Get country name from dataframe column by comparing with a list-CodePudding

How to get country name from dataframe column by comparing with a list of string which contain country names?

For example:

list = ["pakistan","united kingdom","uk","usa","united states","uae"]

# create dataframe column name is job_location of employee
df = pd.DataFrame({
        'job_location' : ['birmingham, england, united kingdom','new jersey, united states','gilgit-baltistan, pakistan','uae','united states','pakistan','31-c2, gulberg 3, lahore, pakistan'],
    })
df 
job_location
0   birmingham, england, united kingdom
1   new jersey, united states
2   gilgit-baltistan, pakistan
3   uae
4   united states
5   pakistan
6   31-c2, gulberg 3, lahore, pakistan

I need a new column in dataframe name as country which contain country name from job_location column.

CodePudding user response：

Not assuming that the country will always be at the end, here is something that should work:

import pandas as pd

country_list = ["pakistan","united kingdom","uk","usa","united states","uae"]

# create dataframe column name is job_location of employee
df = pd.DataFrame({
        'job_location' : ['birmingham, england, united kingdom','new jersey, united states','gilgit-baltistan, pakistan','uae','united states','pakistan','31-c2, gulberg 3, lahore, pakistan'],
    })

matching_countries = []

for key, value in df.items():
    for text in value:
        for country in country_list:
                if country in text:
                    matching_countries.append(country)

df['country'] = matching_countries

print (df)

Outputs:

                          job_location         country
0  birmingham, england, united kingdom  united kingdom
1            new jersey, united states   united states
2           gilgit-baltistan, pakistan        pakistan
3                                  uae             uae
4                        united states   united states
5                             pakistan        pakistan
6   31-c2, gulberg 3, lahore, pakistan        pakistan

CodePudding user response：

Using clist as the list name, you can craft a regex and use str.extract:

reg = '(%s)' % '|'.join(clist)
df['country'] = df['job_location'].str.extract(reg)

Output:

                          job_location         country
0  birmingham, england, united kingdom  united kingdom
1            new jersey, united states   united states
2           gilgit-baltistan, pakistan        pakistan
3                                  uae             uae
4                        united states   united states
5                             pakistan        pakistan
6   31-c2, gulberg 3, lahore, pakistan        pakistan

But honestly, if the job_location is always nicely formatted with the country as the end, it's probably easier to split on comma and keep the last field

CodePudding user response：

First of all, change your list name. I have done it using list comprehension..

df['country'] = [x.split(",")[-1] for x in df['job_location']]

Output:

	job_location	country
0	birmingham, england, united kingdom	united kingdom
1	new jersey, united states	united states
2	gilgit-baltistan, pakistan	pakistan
3	uae	uae
4	united states	united states
5	pakistan	pakistan
6	31-c2, gulberg 3, lahore, pakistan	pakistan