How to get country name from dataframe column by comparing with a list of string which contain country names?
For example:
list = ["pakistan","united kingdom","uk","usa","united states","uae"]
# create dataframe column name is job_location of employee
df = pd.DataFrame({
'job_location' : ['birmingham, england, united kingdom','new jersey, united states','gilgit-baltistan, pakistan','uae','united states','pakistan','31-c2, gulberg 3, lahore, pakistan'],
})
df
job_location
0 birmingham, england, united kingdom
1 new jersey, united states
2 gilgit-baltistan, pakistan
3 uae
4 united states
5 pakistan
6 31-c2, gulberg 3, lahore, pakistan
I need a new column in dataframe name as country which contain country name from job_location column.
CodePudding user response:
Not assuming that the country will always be at the end, here is something that should work:
import pandas as pd
country_list = ["pakistan","united kingdom","uk","usa","united states","uae"]
# create dataframe column name is job_location of employee
df = pd.DataFrame({
'job_location' : ['birmingham, england, united kingdom','new jersey, united states','gilgit-baltistan, pakistan','uae','united states','pakistan','31-c2, gulberg 3, lahore, pakistan'],
})
matching_countries = []
for key, value in df.items():
for text in value:
for country in country_list:
if country in text:
matching_countries.append(country)
df['country'] = matching_countries
print (df)
Outputs:
job_location country
0 birmingham, england, united kingdom united kingdom
1 new jersey, united states united states
2 gilgit-baltistan, pakistan pakistan
3 uae uae
4 united states united states
5 pakistan pakistan
6 31-c2, gulberg 3, lahore, pakistan pakistan
CodePudding user response:
Using clist
as the list name, you can craft a regex and use str.extract
:
reg = '(%s)' % '|'.join(clist)
df['country'] = df['job_location'].str.extract(reg)
Output:
job_location country
0 birmingham, england, united kingdom united kingdom
1 new jersey, united states united states
2 gilgit-baltistan, pakistan pakistan
3 uae uae
4 united states united states
5 pakistan pakistan
6 31-c2, gulberg 3, lahore, pakistan pakistan
But honestly, if the job_location is always nicely formatted with the country as the end, it's probably easier to split on comma and keep the last field
CodePudding user response:
First of all, change your list name. I have done it using list comprehension..
df['country'] = [x.split(",")[-1] for x in df['job_location']]
Output:
job_location | country | |
---|---|---|
0 | birmingham, england, united kingdom | united kingdom |
1 | new jersey, united states | united states |
2 | gilgit-baltistan, pakistan | pakistan |
3 | uae | uae |
4 | united states | united states |
5 | pakistan | pakistan |
6 | 31-c2, gulberg 3, lahore, pakistan | pakistan |