Here is a toy example of my pandas dataframe:
country_market language_market
0 United States English
1 United States French
2 Not used Not used
3 Canada OR United States English
4 Germany English
5 United Kingdom French
6 United States German
7 United Kingdom English
8 United Kingdom English
9 Not used Not used
10 United States French
11 United States English
12 United Kingdom English
13 United States French
14 Not used English
15 Not used English
16 United States French
17 United States Not used
18 Not used English
19 United States German
I want to add a column top_country
that shows whether the value in country_market
is one of the top two most commonly seen countries in the data. If it is, I want the new top_country
column show the value in country_market
and if not, then I want it to show "Other". I want to repeat this process forlanguage_market
(and a whole load of other market columns I don't show here).
This is how I'd like the data to look after processing:
country_market language_market top_country top_language
0 United States English United States English
1 United States French United States French
2 Not used Not used Not used Other
3 Canada OR United States English Other English
4 Germany English Other English
5 United Kingdom French Other French
6 United States German United States Other
7 United Kingdom English Other English
8 United Kingdom English Other English
9 Not used Not used Not used Other
10 United States French United States French
11 United States English United States English
12 United Kingdom English Other English
13 United States French United States French
14 Not used English Not used English
15 Not used English Not used English
16 United States French United States French
17 United States Not used United States Other
18 Not used English Not used English
19 United States German United States Other
I made a function original_top_markets_function
to do this, but I couldn't figure how to pass the value_counts
part of my function to pandas apply
. I kept getting AttributeError: 'str' object has no attribute 'value_counts'
.
def original_top_markets_function(x):
top2 = x.value_counts().nlargest(2).index
for i in x:
if i in top2:
return i
else:
return 'Other'
I know this is because apply
is looking at each element in my target column, but I also need the function to consider the whole column at once, so that I can use value_counts
. I don't know how to do that.
So I have come up with this top_markets
function as a solution, using a list, which does what I want, but isn't very efficient. I'll need to apply this function to lots of different market columns, so I'd like something more pythonic.
def top_markets(x):
top2 = x.value_counts().nlargest(2).index
results = []
for i in x:
if i in top2:
results.append(i)
else:
results.append('Other')
return results
Here's a reproducible example. Please can somehow help me fix my top_markets
function so I can use it with apply
?
import pandas as pd
d = {0: {'country_market': 'United States', 'language_market': 'English'},
1: {'country_market': 'United States', 'language_market': 'French'},
2: {'country_market': 'Not used', 'language_market': 'Not used'},
3: {'country_market': 'Canada OR United States',
'language_market': 'English'},
4: {'country_market': 'Germany', 'language_market': 'English'},
5: {'country_market': 'United Kingdom', 'language_market': 'French'},
6: {'country_market': 'United States', 'language_market': 'German'},
7: {'country_market': 'United Kingdom', 'language_market': 'English'},
8: {'country_market': 'United Kingdom', 'language_market': 'English'},
9: {'country_market': 'Not used', 'language_market': 'Not used'},
10: {'country_market': 'United States', 'language_market': 'French'},
11: {'country_market': 'United States', 'language_market': 'English'},
12: {'country_market': 'United Kingdom', 'language_market': 'English'},
13: {'country_market': 'United States', 'language_market': 'French'},
14: {'country_market': 'Not used', 'language_market': 'English'},
15: {'country_market': 'Not used', 'language_market': 'English'},
16: {'country_market': 'United States', 'language_market': 'French'},
17: {'country_market': 'United States', 'language_market': 'Not used'},
18: {'country_market': 'Not used', 'language_market': 'English'},
19: {'country_market': 'United States', 'language_market': 'German'}}
df = pd.DataFrame.from_dict(d, orient='index')
def top_markets(x):
top2 = x.value_counts().nlargest(2).index
results = []
for i in x:
if i in top2:
results.append(i)
else:
results.append('Other')
return results
df['top_country'] = top_markets(df['country_market'])
df['top_language'] = top_markets(df['language_market'])
df
CodePudding user response:
If need working by multiple columns by DataFrame.apply
in some function, e.g. here lambda function
use:
cols = ['language_market', 'country_market']
f = lambda x: np.where(x.isin(x.value_counts().nlargest(2).index), x, 'Other')
df = df.join(df[cols].apply(f).add_prefix('total_'))
Solution without lambda function:
def top_markets(x):
return np.where(x.isin(x.value_counts().nlargest(2).index), x, 'Other')
df = df.join(df[cols].apply(top_markets).add_prefix('total_'))
CodePudding user response:
I think u can just use:
df['top_country'] = np.where(df['country_market'].isin(df['country_market'].value_counts().nlargest(2).index), df['country_market'], 'Other')
df['top_language'] = np.where(df['language_market'].isin(df['language_market'].value_counts().nlargest(2).index), df['language_market'], 'Other')
If u wish to use your own function, you can use:
df['top_country'] = df[['country_market']].apply(top_markets)
df['top_language'] = df[['language_market']].apply(top_markets)
#OR
df[['top_country', 'top_language']] = df[['country_market', 'language_market']].apply(top_markets)
Edit as per discussion in comments:
def top_markets(x, top):
if x in top:
return x
else:
'Other'
top_country = df['country_market'].value_counts().nlargest(2).index
top_languages = df['language_market'].value_counts().nlargest(2).index
df['top_country'] = df['country_market'].apply(lambda x: top_markets(x, top_country))
df['top_language'] = df['language_market'].apply(lambda x: top_markets(x, top_languages))