I have a table which includes salary
and company_location
.
I was trying to calculate the mean salary of a country, its works:
wage = df.groupby('company_location').mean()['salary']
However, I have many with company_location
which have less than 5 entries, I would like to exclude them from the report.
I know how to calculate countries with the top 5 entries:
Top_5 = df['company_location'].value_counts().head(5)
I am just having a problem connecting those to variables into one and making a graph out of it...
Thank you.
CodePudding user response:
You can remove rows whose value occurrence is below a threshold:
df = df[df.groupby('company_location')['company_location'].transform('size') > 5]
CodePudding user response:
You can do the following to only apply the groupby and aggregation to those with more than 5 records:
mask = (df['company_location'].map(df['company_location'].value_counts()) > 5)
wage = df[mask].groupby('company_location')['salary'].mean()