This is not the best approach but this what I did so far:
I have this example df:
df = pd.DataFrame({
'City': ['I lived Los Angeles', 'I visited London and Toronto','the best one is Toronto', 'business hub is in New York',' Mexico city is stunning']
})
df
gives:
City
0 I lived Los Angeles
1 I visited London and Toronto
2 the best one is Toronto
3 business hub is in New York
4 Mexico city is stunning
I am trying to match (case insensitive) city names from a nested dic and create a new column with the country name with int values for statistical purposes.
So, here is my nested dic as a reference for countries and cities:
country = { 'US': ['New York','Los Angeles','San Diego'],
'CA': ['Montreal','Toronto','Manitoba'],
'UK': ['London','Liverpool','Manchester']
}
and I created a function that should look for the city from the df and match it with the dic, then create a column with the country name:
def get_country(x):
count = 0
for k,v in country.items():
for y in v:
if y.lower() in x:
df[k] = count 1
else:
return None
then applied it to df:
df.City.apply(lambda x: get_country(x.lower()))
I got the following output:
City US
0 I lived Los Angeles 1
1 I visited London and Toronto 1
2 the best one is Toronto 1
3 business hub is in New York 1
4 Mexico city is stunning 1
Expected output:
City US CA UK
0 I lived Los Angeles 1 0 0
1 I visited London and Toronto 0 1 1
2 the best one is Toronto 0 1 0
3 business hub is in New York 1 0 0
4 Mexico city is stunning 0 0 0
CodePudding user response:
You can achieve this result using a lambda function to check if any city for each country is contained in the string, after first lower-casing the city names in country
:
cl = { k : list(map(str.lower, v)) for k, v in country.items() }
for ctry, cities in cl.items():
df[ctry] = df['City'].apply(lambda s:any(c in s.lower() for c in cities)).astype(int)
Output:
City US CA UK
0 I lived Los Angeles 1 0 0
1 I visited London and Toronto 0 1 1
2 the best one is Toronto 0 1 0
3 business hub is in New York 1 0 0
4 Mexico city is stunning 0 0 0
CodePudding user response:
Here is a solution based on your function. I changed the name of the variables to be more readable and easy to follow.
df = pd.DataFrame({
'City': ['I lived Los Angeles',
'I visited London and Toronto',
'the best one is Toronto',
'business hub is in New York',
' Mexico city is stunning']
})
country_cities = {
'US': ['New York','Los Angeles','San Diego'],
'CA': ['Montreal','Toronto','Manitoba'],
'UK': ['London','Liverpool','Manchester']
}
def get_country(text):
text = text.lower()
count = 0
country_counts = dict.fromkeys(country_cities, 0)
for country, cities in country_cities.items():
for city in cities:
if city.lower() in text:
country_counts[country] = 1
return pd.Series(country_counts)
df = df.join(df.City.apply(get_country))
Output:
City US CA UK
0 I lived Los Angeles 1 0 0
1 I visited London and Toronto 0 1 1
2 the best one is Toronto 0 1 0
3 business hub is in New York 1 0 0
4 Mexico city is stunning 0 0 0
Why your solution doesn't work?
if y.lower() in x: df[k] = count 1 else: return None
The reason your function doesn't produce the right output is that
you are returning None
if a city is not found in the text: the remaining countries and cities are not checked, because it automatically exits the function.
What is happening is that only US
cities are checked, and the line df[k] = 1
creates a entire column named k
filled with the value 1. It's not creating a single value for that row, it creates or modifies the full column. When using apply
you want to change a single row or value (the input of function), so don't change directly the main DataFrame inside the function.