Home > Software design >  How to sum the rows by using the name of multiindex in pandas?
How to sum the rows by using the name of multiindex in pandas?

Time:10-28

I am doing the exercise from

https://github.com/guipsamora/pandas_exercises/blob/master/03_Grouping/Occupation/Exercise.ipynb

I tried to use the code below to solve question 5:

Step 5. Discover the Male ratio per occupation and sort it from the most to the least

users.groupby("occupation").gender.value_counts().loc[occupation]

However, it always shows this error:

>>> users.groupby("occupation").gender.value_counts().loc[occupation]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'occupation' is not defined

The reason is that multiindex is used instead of single indexing.

a = pd.DataFrame(users.groupby("occupation").gender.value_counts())
a.axes
>>> [MultiIndex([('administrator', 'M'),
            ('administrator', 'F'),
            (       'artist', 'M'),
            (       'artist', 'F'),
            (       'doctor', 'M'),
            (     'educator', 'M'),
            (     'educator', 'F'),
            (     'engineer', 'M'),
            (     'engineer', 'F'),
            ('entertainment', 'M'),
            ('entertainment', 'F'),
            (    'executive', 'M'),
            (    'executive', 'F'),
            (   'healthcare', 'F'),
            (   'healthcare', 'M'),
            (    'homemaker', 'F'),
            (    'homemaker', 'M'),
            (       'lawyer', 'M'),
            (       'lawyer', 'F'),
            (    'librarian', 'F'),
            (    'librarian', 'M'),
            (    'marketing', 'M'),
            (    'marketing', 'F'),
            (         'none', 'M'),
            (         'none', 'F'),
            (        'other', 'M'),
            (        'other', 'F'),
            (   'programmer', 'M'),
            (   'programmer', 'F'),
            (      'retired', 'M'),
            (      'retired', 'F'),
            (     'salesman', 'M'),
            (     'salesman', 'F'),
            (    'scientist', 'M'),
            (    'scientist', 'F'),
            (      'student', 'M'),
            (      'student', 'F'),
            (   'technician', 'M'),
            (   'technician', 'F'),
            (       'writer', 'M'),
            (       'writer', 'F')],
           names=['occupation', 'gender']), Index(['gender'], dtype='object')]

I want to sum the row by grouping gender and occupation, is there any way to do it without using forloop?

CodePudding user response:

You could try the following:

male_ratios = (
    users
    .groupby("occupation")["gender"].value_counts()
    .unstack()
    .assign(male_ratio=lambda df: df["M"] / df.sum(axis=1))
    .sort_values("male_ratio", ascending=False)
)

With .unstack you're pulling the gender part of the multiindex into the columns. Then calculate the ratios and put them into a new column male_ratios, and sort the dataframe in descending order along the new column.

Example dataframe:

from random import choice

occupations = ['administrator', 'artist', 'doctor', 'retired']
genders = ['F', 'M']
users = pd.DataFrame(
    [(n, choice(occupations), choice(genders)) for n in range(100)],
    columns=["id", "occupation", "gender"]
)
    id     occupation gender
0    0        retired      F
1    1  administrator      M
2    2         artist      F
3    3  administrator      F
4    4  administrator      F
..  ..            ...    ...
95  95         artist      M
96  96         doctor      M
97  97        retired      M
98  98  administrator      F
99  99  administrator      F

[100 rows x 3 columns]

Result of the code above:

gender          F   M  male_ratio
occupation                       
doctor          8  13    0.619048
artist         12  13    0.520000
retired        17  16    0.484848
administrator  14   7    0.333333

You could do the same with .pivot_table:

male_ratios = (
    users
    .pivot_table(index="occupation", columns="gender", aggfunc="count")
    .droplevel(0, axis=1)
    .assign(male_ratio=lambda df: df["M"] / df.sum(axis=1))
    .sort_values("male_ratio", ascending=False)
)
  • Related