I am doing the exercise from
https://github.com/guipsamora/pandas_exercises/blob/master/03_Grouping/Occupation/Exercise.ipynb
I tried to use the code below to solve question 5:
Step 5. Discover the Male ratio per occupation and sort it from the most to the least
users.groupby("occupation").gender.value_counts().loc[occupation]
However, it always shows this error:
>>> users.groupby("occupation").gender.value_counts().loc[occupation]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'occupation' is not defined
The reason is that multiindex is used instead of single indexing.
a = pd.DataFrame(users.groupby("occupation").gender.value_counts())
a.axes
>>> [MultiIndex([('administrator', 'M'),
('administrator', 'F'),
( 'artist', 'M'),
( 'artist', 'F'),
( 'doctor', 'M'),
( 'educator', 'M'),
( 'educator', 'F'),
( 'engineer', 'M'),
( 'engineer', 'F'),
('entertainment', 'M'),
('entertainment', 'F'),
( 'executive', 'M'),
( 'executive', 'F'),
( 'healthcare', 'F'),
( 'healthcare', 'M'),
( 'homemaker', 'F'),
( 'homemaker', 'M'),
( 'lawyer', 'M'),
( 'lawyer', 'F'),
( 'librarian', 'F'),
( 'librarian', 'M'),
( 'marketing', 'M'),
( 'marketing', 'F'),
( 'none', 'M'),
( 'none', 'F'),
( 'other', 'M'),
( 'other', 'F'),
( 'programmer', 'M'),
( 'programmer', 'F'),
( 'retired', 'M'),
( 'retired', 'F'),
( 'salesman', 'M'),
( 'salesman', 'F'),
( 'scientist', 'M'),
( 'scientist', 'F'),
( 'student', 'M'),
( 'student', 'F'),
( 'technician', 'M'),
( 'technician', 'F'),
( 'writer', 'M'),
( 'writer', 'F')],
names=['occupation', 'gender']), Index(['gender'], dtype='object')]
I want to sum the row by grouping gender and occupation, is there any way to do it without using forloop?
CodePudding user response:
You could try the following:
male_ratios = (
users
.groupby("occupation")["gender"].value_counts()
.unstack()
.assign(male_ratio=lambda df: df["M"] / df.sum(axis=1))
.sort_values("male_ratio", ascending=False)
)
With .unstack
you're pulling the gender
part of the multiindex into the columns. Then calculate the ratios and put them into a new column male_ratios
, and sort the dataframe in descending order along the new column.
Example dataframe:
from random import choice
occupations = ['administrator', 'artist', 'doctor', 'retired']
genders = ['F', 'M']
users = pd.DataFrame(
[(n, choice(occupations), choice(genders)) for n in range(100)],
columns=["id", "occupation", "gender"]
)
id occupation gender
0 0 retired F
1 1 administrator M
2 2 artist F
3 3 administrator F
4 4 administrator F
.. .. ... ...
95 95 artist M
96 96 doctor M
97 97 retired M
98 98 administrator F
99 99 administrator F
[100 rows x 3 columns]
Result of the code above:
gender F M male_ratio
occupation
doctor 8 13 0.619048
artist 12 13 0.520000
retired 17 16 0.484848
administrator 14 7 0.333333
You could do the same with .pivot_table
:
male_ratios = (
users
.pivot_table(index="occupation", columns="gender", aggfunc="count")
.droplevel(0, axis=1)
.assign(male_ratio=lambda df: df["M"] / df.sum(axis=1))
.sort_values("male_ratio", ascending=False)
)