Home > Blockchain >  Calculating difference in multi index pandas data frame
Calculating difference in multi index pandas data frame

Time:12-24

I have a df:

df_test = sns.load_dataset("flights")
df_test['cat_2'] = np.random.choice(range(10), df_test.shape[0])
df_test.pivot_table(index='month', 
                      columns='year', 
                      values=['passengers', 'cat_2'])\
                          .swaplevel(0,1, axis=1)\
                          .sort_index(axis=1, level=0)\
                          .fillna(0)

I am trying to calculate the difference between cat_2 and passengers each year compared to the year before.

What is the best way to achieve this? Desired output would look similar to this:

year    1949                            1950                                1951    
        cat_2   passengers    % diff    cat_2   passengers   % diff         cat_2   passengers     % diff
month                                                                                   
Jan     6       112           0         6       115          115/112        6       90                90/115
Feb     0       118           0         6       126          126/118        6       150               150 / 126  
Mar     2       132           0         7       141                         7       141
Apr     0       129           0         9       135                         9       135
May     5       121           0         4       125                         4       125
Jun     1       135           0         3       149                         3       149
Jul     6       148           0         5       170                         5       170
Aug     5       148           0         2       170                         2       170
Sep     1       136           0         4       158                         4       158
Oct     5       119           0         5       133                         5       133
Nov     0       104           0         1       114                         1       114
Dec     7       118           0         1       140                         1       140

I only showed the desired calculations for columns passengers but the same calculations method can be used for cat_2 as well. As there is nothing to compare the first year I filled the values with 0.

CodePudding user response:

You can select columns passengers by DataFrame.xs, divide by shifted rows by DataFrame.shift, create MultiIndex and append to original DataFrame:

df_test = sns.load_dataset("flights")
df_test['cat_2'] = np.random.choice(range(10), df_test.shape[0])
df = df_test.pivot_table(index='month', 
                      columns='year', 
                      values=['passengers', 'cat_2'])\
                          .swaplevel(0,1, axis=1)\
                          .sort_index(axis=1, level=0)\
                          .fillna(0)
                          
                          
df1 = df.xs('passengers', axis=1, level=1)

df2 = pd.concat({'% diff': df1.div(df1.shift(axis=1)).fillna(0)}, axis=1)

out = (pd.concat([df, df2.swaplevel(axis=1)], axis=1)
         .sort_index(axis=1, level=0, sort_remaining=False))

Another idea is shift values by add 1 year, only necessary remove last NaN column by iloc:

df1 = df.xs('passengers', axis=1, level=1)

df2 = (pd.concat({'% diff': df1.div(df1.rename(columns=lambda x: x 1))}, axis=1)
         .fillna(0)
         .iloc[:, :-1])

out = (pd.concat([df, df2.swaplevel(axis=1)], axis=1)
         .sort_index(axis=1, level=0, sort_remaining=False))
  • Related