Keyword categorization from strings in a new column in pandas-CodePudding

This is not the best approach but this what I did so far:

I have this example df:

df = pd.DataFrame({
                  'City': ['I lived Los Angeles', 'I visited London and Toronto','the best one is Toronto', 'business hub is in New York',' Mexico city is stunning']
                   })

df

gives:

    City
0   I lived Los Angeles
1   I visited London and Toronto
2   the best one is Toronto
3   business hub is in New York
4   Mexico city is stunning

I am trying to match (case insensitive) city names from a nested dic and create a new column with the country name with int values for statistical purposes.

So, here is my nested dic as a reference for countries and cities:

country = { 'US': ['New York','Los Angeles','San Diego'],
         'CA': ['Montreal','Toronto','Manitoba'],
         'UK': ['London','Liverpool','Manchester']      
       }

and I created a function that should look for the city from the df and match it with the dic, then create a column with the country name:

def get_country(x):
    count = 0
    for k,v in country.items():
        for y in v:
            if y.lower() in x:
                df[k] = count   1
            else:
                return None

then applied it to df:

df.City.apply(lambda x: get_country(x.lower()))

I got the following output:

    City                              US
0   I lived Los Angeles               1
1   I visited London and Toronto      1
2   the best one is Toronto           1
3   business hub is in New York       1
4   Mexico city is stunning           1

Expected output:

    City                              US    CA   UK
0   I lived Los Angeles               1      0    0
1   I visited London and Toronto      0      1    1
2   the best one is Toronto           0      1    0
3   business hub is in New York       1      0    0
4   Mexico city is stunning           0      0    0

CodePudding user response：

You can achieve this result using a lambda function to check if any city for each country is contained in the string, after first lower-casing the city names in country:

cl = { k : list(map(str.lower, v)) for k, v in country.items() }
for ctry, cities in cl.items():
    df[ctry] = df['City'].apply(lambda s:any(c in s.lower() for c in cities)).astype(int)

Output:

                           City  US  CA  UK
0           I lived Los Angeles   1   0   0
1  I visited London and Toronto   0   1   1
2       the best one is Toronto   0   1   0
3   business hub is in New York   1   0   0
4       Mexico city is stunning   0   0   0

CodePudding user response：

Here is a solution based on your function. I changed the name of the variables to be more readable and easy to follow.

df = pd.DataFrame({
    'City': ['I lived Los Angeles', 
             'I visited London and Toronto',
             'the best one is Toronto', 
             'business hub is in New York',
             ' Mexico city is stunning']
})


country_cities = { 
    'US': ['New York','Los Angeles','San Diego'],
    'CA': ['Montreal','Toronto','Manitoba'],
    'UK': ['London','Liverpool','Manchester']      
}

def get_country(text):
    text = text.lower()
    count = 0
    country_counts = dict.fromkeys(country_cities, 0)
    
    for country, cities in country_cities.items():
        for city in cities:
            if city.lower() in text:
                country_counts[country]  = 1 
                
    return pd.Series(country_counts)

df = df.join(df.City.apply(get_country))

Output:

                           City  US  CA  UK
0           I lived Los Angeles   1   0   0
1  I visited London and Toronto   0   1   1
2       the best one is Toronto   0   1   0
3   business hub is in New York   1   0   0
4       Mexico city is stunning   0   0   0

Why your solution doesn't work?

if y.lower() in x:
               df[k] = count   1
           else:
               return None

The reason your function doesn't produce the right output is that you are returning None if a city is not found in the text: the remaining countries and cities are not checked, because it automatically exits the function.

What is happening is that only US cities are checked, and the line df[k] = 1 creates a entire column named k filled with the value 1. It's not creating a single value for that row, it creates or modifies the full column. When using apply you want to change a single row or value (the input of function), so don't change directly the main DataFrame inside the function.