Pandas, Multi-index - re-index both index levels with keymap and keep structure-CodePudding

having a heck of a time getting what I am expecting with pandas multidimensional frames, and specifically when I try to sort them on one column followed by another. I will try to walk through this to best explain what I am after and appreciate the help.

So I have a dataset that I am looking to analyze, and for an example it looks like below in excel:

        Data Opportunity Name Geo Map Forecast Category  Total
0   source_1    opportunity_1   geo_1          pipeline    100
1   source_1    opportunity_2   geo_1            upside    200
2   source_1    opportunity_3   geo_1            commit    300
3   source_1    opportunity_4   geo_1               won    400
4   source_1    opportunity_5   geo_1           omitted    500
5   source_2    opportunity_6   geo_2          pipeline    600
6   source_2    opportunity_7   geo_2            upside    700
7   source_2    opportunity_8   geo_2            commit    800
8   source_2    opportunity_9   geo_2               won    900
9   source_2   opportunity_10   geo_2           omitted   1000
10  source_3   opportunity_11   geo_3          pipeline   1100
11  source_3   opportunity_12   geo_3            upside   1200
12  source_3   opportunity_13   geo_3            commit   1300
13  source_3   opportunity_14   geo_3               won   1400
14  source_3   opportunity_15   geo_3           omitted   1500
15  source_4   opportunity_16   geo_4          pipeline   1600
16  source_4   opportunity_17   geo_4            upside   1700
17  source_4   opportunity_18   geo_4            commit   1800
18  source_4   opportunity_19   geo_4               won   1900
19  source_4   opportunity_20   geo_4           omitted   2000

I am loading that dataset into pandas by reading it from excel into a dataframe, df:

import pandas as pd

df = pd.read_excel('C:/Users/nick/Downloads/example_data.xlsx')

I want to show the data first indexed on the 'Geo Map' field, and then on the 'Forecast Category' field, but I want to have control over the order in which those values are sorted. In order to make sure I can do this, I create a keymap for both of those fields in the two formats I think I could need:

regions_array = {'geo_2': 1, 'geo_1': 2, 'geo_4': 3, 'geo_3':4}
regions_list = ['geo_2', 'geo_1', 'geo_4', 'geo_3']
categories_array = {'pipeline': 1, 'upside': 2, 'commit': 3, 'won': 4, 'omitted': 5}
categories_list = ['pipeline', 'upside', 'commit', 'won', 'omitted']

When I go to move the data into a pivot table, I am able to achieve the multi-index structure I am after via a pivot with:

regional_numbers = pd.pivot_table(df, index=['Geo Map', 'Forecast Category'],
columns='Data', values='Total', fill_value=0, aggfunc='sum')
regional_numbers

result:

Data                       source_1  source_2  source_3  source_4
Geo Map Forecast Category                                        
geo_1   commit                  300         0         0         0
        omitted                 500         0         0         0
        pipeline                100         0         0         0
        upside                  200         0         0         0
        won                     400         0         0         0
geo_2   commit                    0       800         0         0
        omitted                   0      1000         0         0
        pipeline                  0       600         0         0
        upside                    0       700         0         0
        won                       0       900         0         0
geo_3   commit                    0         0      1300         0
        omitted                   0         0      1500         0
        pipeline                  0         0      1100         0
        upside                    0         0      1200         0
        won                       0         0      1400         0
geo_4   commit                    0         0         0      1800
        omitted                   0         0         0      2000
        pipeline                  0         0         0      1600
        upside                    0         0         0      1700
        won                       0         0         0      1900

However, now I need to re-order my indices by first the 'Geo Map' and then by the 'Forecast Category'. I have tried to sort_index, reindex, but the problem is once I sort based on the first everything is fine but the second level breaks the frame.

What I would like my result to look like is based on sorting from my keys for first the 'Geo Map' and then 'Forecast Category':

Data                       source_1  source_2  source_3  source_4
Geo Map Forecast Category                                        
geo_2   pipeline                  0       600         0         0
        upside                    0       700         0         0
        commit                    0       800         0         0
        won                       0       900         0         0
        omitted                   0      1000         0         0  
geo_1   pipeline                100         0         0         0
        upside                  200         0         0         0
        commit                  300         0         0         0
        won                     400         0         0         0
        omitted                 500         0         0         0
geo_4   pipeline                  0         0         0      1600
        upside                    0         0         0      1700
        commit                    0         0         0      1800
        won                       0         0         0      1900
        omitted                   0         0         0      2000      
geo_3   pipeline                  0         0      1100         0
        upside                    0         0      1200         0
        commit                    0         0      1300         0
        won                       0         0      1400         0
        omitted                   0         0      1500         0

Extra Credit

If I could then add total rows into the equation, that would be the ultimate end state I am looking for. I was trying to add margins=True and margin_name in the pivot arguments but then I needed to sort the Total column in the keys which I ideally want to avoid. End state for the extra credit would be:

Data                       source_1  source_2  source_3  source_4
Geo Map Forecast Category                                        
geo_2   pipeline                  0       600         0         0
        upside                    0       700         0         0
        commit                    0       800         0         0
        won                       0       900         0         0
        omitted                   0      1000         0         0
        Total                     0      4000         0         0  

geo_1   ...                     ...       ...       ...       ...
        Total                     X         X         X         X

geo_4   ...                     ...       ...       ...       ...
        Total                     X         X         X         X

geo_3   ...                     ...       ...       ...       ...
        Total                     X         X         X         X

TOTALS                            X         X         X         X

It would also be nice to have the totals to the right of all the sources as well on a row level, but having trouble getting that to show correctly.

If the way I am asking for the totals is not possible, I am sure I can figure out totaling the values I want separately but would be nice.

Thanks for any help you can provide!

EDITED to fix my expected output.

CodePudding user response：

Note: categories_array have keys starting with capital letters, change them to be the same as in your dataframe.

Try:

def add_total(x):
    x = pd.concat(
        [
            x,
            pd.DataFrame(
                {c: [x[c].sum()] for c in x.columns}, index=[("", "Total")]
            ),
        ]
    )
    return x


regional_numbers = regional_numbers.reindex(
    sorted(
        regional_numbers.index,
        key=lambda k: (regions_array[k[0]], categories_array[k[1]]),
    )
)

regional_numbers = (
    regional_numbers.groupby(level=0, as_index=False, sort=False)
    .apply(add_total)
    .droplevel(0)
)

regional_numbers["TOTAL"] = regional_numbers.sum(axis=1)

regional_numbers = pd.concat(
    [
        regional_numbers,
        pd.DataFrame(
            {c: [regional_numbers[c].sum()] for c in regional_numbers.columns},
            index=[("TOTAL", "")],
        ),
    ]
)

print(regional_numbers)

Prints:

                           source_1  source_2  source_3  source_4  TOTAL
Geo_Map Forecast_Category                                               
geo_2   pipeline                  0       600         0         0    600
        upside                    0       700         0         0    700
        commit                    0       800         0         0    800
        won                       0       900         0         0    900
        omitted                   0      1000         0         0   1000
        Total                     0      4000         0         0   4000
geo_1   pipeline                100         0         0         0    100
        upside                  200         0         0         0    200
        commit                  300         0         0         0    300
        won                     400         0         0         0    400
        omitted                 500         0         0         0    500
        Total                  1500         0         0         0   1500
geo_4   pipeline                  0         0         0      1600   1600
        upside                    0         0         0      1700   1700
        commit                    0         0         0      1800   1800
        won                       0         0         0      1900   1900
        omitted                   0         0         0      2000   2000
        Total                     0         0         0      9000   9000
geo_3   pipeline                  0         0      1100         0   1100
        upside                    0         0      1200         0   1200
        commit                    0         0      1300         0   1300
        won                       0         0      1400         0   1400
        omitted                   0         0      1500         0   1500
        Total                     0         0      6500         0   6500
TOTAL                          3000      8000     13000     18000  42000

CodePudding user response：

Try using pd.MultiIndex.from_products then reindex:

idx = pd.MultiIndex.from_product([regions_list, categories_list], names=['Geo Map', 'Forecast Category'])
df_out = regional_numbers.reindex(idx)
df_out

Output:

Data                       source_1  source_2  source_3  source_4
Geo Map Forecast Category                                        
geo_2   pipeline                  0       600         0         0
        upside                    0       700         0         0
        commit                    0       800         0         0
        won                       0       900         0         0
        omitted                   0      1000         0         0
geo_1   pipeline                100         0         0         0
        upside                  200         0         0         0
        commit                  300         0         0         0
        won                     400         0         0         0
        omitted                 500         0         0         0
geo_4   pipeline                  0         0         0      1600
        upside                    0         0         0      1700
        commit                    0         0         0      1800
        won                       0         0         0      1900
        omitted                   0         0         0      2000
geo_3   pipeline                  0         0      1100         0
        upside                    0         0      1200         0
        commit                    0         0      1300         0
        won                       0         0      1400         0
        omitted                   0         0      1500         0