pandas: use value_counts in an apply function-CodePudding

Here is a toy example of my pandas dataframe:

    country_market  language_market
0   United States   English
1   United States   French
2   Not used    Not used
3   Canada OR United States English
4   Germany English
5   United Kingdom  French
6   United States   German
7   United Kingdom  English
8   United Kingdom  English
9   Not used    Not used
10  United States   French
11  United States   English
12  United Kingdom  English
13  United States   French
14  Not used    English
15  Not used    English
16  United States   French
17  United States   Not used
18  Not used    English
19  United States   German

I want to add a column top_country that shows whether the value in country_market is one of the top two most commonly seen countries in the data. If it is, I want the new top_country column show the value in country_market and if not, then I want it to show "Other". I want to repeat this process forlanguage_market (and a whole load of other market columns I don't show here).

This is how I'd like the data to look after processing:

    country_market  language_market top_country top_language
0   United States   English United States   English
1   United States   French  United States   French
2   Not used    Not used    Not used    Other
3   Canada OR United States English Other   English
4   Germany English Other   English
5   United Kingdom  French  Other   French
6   United States   German  United States   Other
7   United Kingdom  English Other   English
8   United Kingdom  English Other   English
9   Not used    Not used    Not used    Other
10  United States   French  United States   French
11  United States   English United States   English
12  United Kingdom  English Other   English
13  United States   French  United States   French
14  Not used    English Not used    English
15  Not used    English Not used    English
16  United States   French  United States   French
17  United States   Not used    United States   Other
18  Not used    English Not used    English
19  United States   German  United States   Other

I made a function original_top_markets_function to do this, but I couldn't figure how to pass the value_counts part of my function to pandas apply. I kept getting AttributeError: 'str' object has no attribute 'value_counts'.

def original_top_markets_function(x):
top2 = x.value_counts().nlargest(2).index
for i in x:
    if i in top2: 
        return i
    else: 
        return 'Other'

I know this is because apply is looking at each element in my target column, but I also need the function to consider the whole column at once, so that I can use value_counts. I don't know how to do that.

So I have come up with this top_markets function as a solution, using a list, which does what I want, but isn't very efficient. I'll need to apply this function to lots of different market columns, so I'd like something more pythonic.

def top_markets(x):
top2 = x.value_counts().nlargest(2).index
results = []
for i in x:
    if i in top2: 
        results.append(i)
    else: 
        results.append('Other')         
return results

Here's a reproducible example. Please can somehow help me fix my top_markets function so I can use it with apply?

import pandas as pd

d = {0: {'country_market': 'United States', 'language_market': 'English'},
 1: {'country_market': 'United States', 'language_market': 'French'},
 2: {'country_market': 'Not used', 'language_market': 'Not used'},
 3: {'country_market': 'Canada OR United States',
  'language_market': 'English'},
 4: {'country_market': 'Germany', 'language_market': 'English'},
 5: {'country_market': 'United Kingdom', 'language_market': 'French'},
 6: {'country_market': 'United States', 'language_market': 'German'},
 7: {'country_market': 'United Kingdom', 'language_market': 'English'},
 8: {'country_market': 'United Kingdom', 'language_market': 'English'},
 9: {'country_market': 'Not used', 'language_market': 'Not used'},
 10: {'country_market': 'United States', 'language_market': 'French'},
 11: {'country_market': 'United States', 'language_market': 'English'},
 12: {'country_market': 'United Kingdom', 'language_market': 'English'},
 13: {'country_market': 'United States', 'language_market': 'French'},
 14: {'country_market': 'Not used', 'language_market': 'English'},
 15: {'country_market': 'Not used', 'language_market': 'English'},
 16: {'country_market': 'United States', 'language_market': 'French'},
 17: {'country_market': 'United States', 'language_market': 'Not used'},
 18: {'country_market': 'Not used', 'language_market': 'English'},
 19: {'country_market': 'United States', 'language_market': 'German'}}

df = pd.DataFrame.from_dict(d, orient='index')

def top_markets(x):
    top2 = x.value_counts().nlargest(2).index
    results = []
    for i in x:
        if i in top2: 
            results.append(i)
        else: 
            results.append('Other')         
    return results

df['top_country'] = top_markets(df['country_market'])
df['top_language'] = top_markets(df['language_market'])

df

CodePudding user response：

If need working by multiple columns by DataFrame.apply in some function, e.g. here lambda function use:

cols = ['language_market', 'country_market']

f = lambda x: np.where(x.isin(x.value_counts().nlargest(2).index), x, 'Other')
df = df.join(df[cols].apply(f).add_prefix('total_'))

Solution without lambda function:

def top_markets(x):
    return np.where(x.isin(x.value_counts().nlargest(2).index), x, 'Other')

df = df.join(df[cols].apply(top_markets).add_prefix('total_'))

CodePudding user response：

I think u can just use:

df['top_country'] = np.where(df['country_market'].isin(df['country_market'].value_counts().nlargest(2).index), df['country_market'], 'Other')
df['top_language'] = np.where(df['language_market'].isin(df['language_market'].value_counts().nlargest(2).index), df['language_market'], 'Other')

If u wish to use your own function, you can use:

df['top_country'] = df[['country_market']].apply(top_markets)
df['top_language'] = df[['language_market']].apply(top_markets)

#OR
df[['top_country', 'top_language']] = df[['country_market', 'language_market']].apply(top_markets)

Edit as per discussion in comments:

def top_markets(x, top):
    if x in top:
        return x
    else:
        'Other'

top_country = df['country_market'].value_counts().nlargest(2).index
top_languages = df['language_market'].value_counts().nlargest(2).index

df['top_country'] = df['country_market'].apply(lambda x: top_markets(x, top_country))
df['top_language'] = df['language_market'].apply(lambda x: top_markets(x, top_languages))