Home > database >  Get country name from dataframe column by comparing with a list
Get country name from dataframe column by comparing with a list

Time:12-08

How to get country name from dataframe column by comparing with a list of string which contain country names?

For example:

list = ["pakistan","united kingdom","uk","usa","united states","uae"]

# create dataframe column name is job_location of employee
df = pd.DataFrame({
        'job_location' : ['birmingham, england, united kingdom','new jersey, united states','gilgit-baltistan, pakistan','uae','united states','pakistan','31-c2, gulberg 3, lahore, pakistan'],
    })
df 
job_location
0   birmingham, england, united kingdom
1   new jersey, united states
2   gilgit-baltistan, pakistan
3   uae
4   united states
5   pakistan
6   31-c2, gulberg 3, lahore, pakistan

I need a new column in dataframe name as country which contain country name from job_location column.

CodePudding user response:

Not assuming that the country will always be at the end, here is something that should work:

import pandas as pd

country_list = ["pakistan","united kingdom","uk","usa","united states","uae"]

# create dataframe column name is job_location of employee
df = pd.DataFrame({
        'job_location' : ['birmingham, england, united kingdom','new jersey, united states','gilgit-baltistan, pakistan','uae','united states','pakistan','31-c2, gulberg 3, lahore, pakistan'],
    })

matching_countries = []

for key, value in df.items():
    for text in value:
        for country in country_list:
                if country in text:
                    matching_countries.append(country)

df['country'] = matching_countries

print (df)

Outputs:

                          job_location         country
0  birmingham, england, united kingdom  united kingdom
1            new jersey, united states   united states
2           gilgit-baltistan, pakistan        pakistan
3                                  uae             uae
4                        united states   united states
5                             pakistan        pakistan
6   31-c2, gulberg 3, lahore, pakistan        pakistan

CodePudding user response:

Using clist as the list name, you can craft a regex and use str.extract:

reg = '(%s)' % '|'.join(clist)
df['country'] = df['job_location'].str.extract(reg)

Output:

                          job_location         country
0  birmingham, england, united kingdom  united kingdom
1            new jersey, united states   united states
2           gilgit-baltistan, pakistan        pakistan
3                                  uae             uae
4                        united states   united states
5                             pakistan        pakistan
6   31-c2, gulberg 3, lahore, pakistan        pakistan

But honestly, if the job_location is always nicely formatted with the country as the end, it's probably easier to split on comma and keep the last field

CodePudding user response:

First of all, change your list name. I have done it using list comprehension..

df['country'] = [x.split(",")[-1] for x in df['job_location']]

Output:

job_location country
0 birmingham, england, united kingdom united kingdom
1 new jersey, united states united states
2 gilgit-baltistan, pakistan pakistan
3 uae uae
4 united states united states
5 pakistan pakistan
6 31-c2, gulberg 3, lahore, pakistan pakistan
  • Related