Home > database >  groupby in pandas with custom function over a subset of rows in each group
groupby in pandas with custom function over a subset of rows in each group

Time:11-25

I have a pandas DataFrame of the following format:

Input:

                                     X    [OTHER_COLUMNS]
version     branch                      
v0          overall  2475.0         -1         .
v1          overall  2475.0         -1         .
            A        1712.5          1         .
            B         257.5          2         .
            C         392.5          2
            D         112.5          3
v2          overall  2475.0         -1
            A        2341.5          1
            B          95.0          2
            C          38.5          2
v3          overall  2475.0         -1
            A        2000.0          1
            B         475.0          2
v4          overall  2475.0         -1
            A        2341.5          1
            B         133.5          1

where (version, branch) is a MultiIndex.

PROBLEM DESCRIPTION:

I want to groupby version and set the values in the column X with branch overall to the sum of the values in the column X for the remaining branches (having the same version), weighted by the values in the column N. For groups (i.e. versions) which have only one branch (named overall), I want X to be set to 1.

EXAMPLE:

For version v2, the value in the cell with column X and branch overall should be

(2341.5 * 1 95.0 * 2 38.5 * 2) / 2475.0 = 1.05393939394,

and in pseudo-code:

(A_N * A_X B_N * B_X) / overall_N.

Note: For a given version, the value in column N and branch overall will always be equal to the sum of the values in column N for the other branch'es.

IDEA AND QUESTION:

I think I have to do the following:

df.loc[pd.IndexSlice[:, 'overall'], 'X'] = df.groupby('version').apply(...)

where df is the DataFrame and where ... is to be replaced by a custom function.

I am looking for help in constructing such a function.

Expected output:

                          N          X
version     branch                      
v0          overall  2475.0          1         
v1          overall  2475.0          1.35353535354
            A        1712.5          1
            B         257.5          2
            C         392.5          2
            D         112.5          3
v2          overall  2475.0          1.05393939394
            A        2341.5          1
            B          95.0          2
            C          38.5          2
v3          overall  2475.0          1.19191919192
            A        2000.0          1
            B         475.0          2
v4          overall  2475.0          1
            A        2341.5          1
            B         133.5          1

Explaination of expected output:

(1712.5 * 1   257.5 * 2   392.5 * 2   112.5 * 3) / 2475.0 = 1.35353535354

(2341.5 * 1   95.0 * 2   38.5 * 2) / 2475.0 = 1.05393939394

(2000.0 * 1   475.0 * 2) / 2475.0 = 1.19191919192

(2341.5 * 1   133.5 * 1) / 2475.0 = 1

CODE TO CREATE DATAFRAME:

import numpy as np
import pandas as pd
    
df = pd.DataFrame(
    data=np.array(
        [
            [2475.0, 2475.0, 1712.5, 257.5, 392.5, 112.5, 2475.0, 2341.5, 95.0, 38.5, 2475.0, 2000.0, 475.0, 2475.0, 2341.5, 133.5],
            [-1, -1, 1, 2, 2, 3, -1, 1, 2, 2, -1, 1, 2, -1, 1, 1]
        ]
    ).T,
    index=pd.MultiIndex.from_tuples(
        tuples=[
            ('v0', 'overall'),
            ('v1', 'overall'),
            ('v1', 'A'),
            ('v1', 'B'),
            ('v1', 'C'),
            ('v1', 'D'),
            ('v2', 'overall'),
            ('v2', 'A'),
            ('v2', 'B'),
            ('v2', 'C'),
            ('v3', 'overall'),
            ('v3', 'A'),
            ('v3', 'B'),
            ('v4', 'overall'),
            ('v4', 'A'),
            ('v4', 'B'),
        ],
        names=['version', 'branch'],
    ),
    columns=['N', 'X'],
)

print (df)
                      N    X
version branch              
v0      overall  2475.0 -1.0
v1      overall  2475.0 -1.0
        A        1712.5  1.0
        B         257.5  2.0
        C         392.5  2.0
        D         112.5  3.0
v2      overall  2475.0 -1.0
        A        2341.5  1.0
        B          95.0  2.0
        C          38.5  2.0
v3      overall  2475.0 -1.0
        A        2000.0  1.0
        B         475.0  2.0
v4      overall  2475.0 -1.0
        A        2341.5  1.0
        B         133.5  1.0

CodePudding user response:

Use:

#select overalls only
overall = df['N'].xs('overall', level=1)
#select all rows without overalls
df1 = df.drop('overall', level=1)

#multiple and aggregate sum, divide overalls   
s = df1['N'].mul(df1['X']).groupby(level=0).sum().div(overall)

#create MultiIndex and assign back
df.loc[pd.IndexSlice[:, 'overall'], 'X'] = pd.concat({'overall':s}).swaplevel(0,1)

print (df)
                      N         X
version branch                   
v1      overall  2475.0  1.353535
        A        1712.5  1.000000
        B         257.5  2.000000
        C         392.5  2.000000
        D         112.5  3.000000
v2      overall  2475.0  1.053939
        A        2341.5  1.000000
        B          95.0  2.000000
        C          38.5  2.000000
v3      overall  2475.0  1.191919
        A        2000.0  1.000000
        B         475.0  2.000000
v4      overall  2475.0  1.000000
        A        2341.5  1.000000
        B         133.5  1.000000
  • Related