having a heck of a time getting what I am expecting with pandas multidimensional frames, and specifically when I try to sort them on one column followed by another. I will try to walk through this to best explain what I am after and appreciate the help.
So I have a dataset that I am looking to analyze, and for an example it looks like below in excel:
Data Opportunity Name Geo Map Forecast Category Total
0 source_1 opportunity_1 geo_1 pipeline 100
1 source_1 opportunity_2 geo_1 upside 200
2 source_1 opportunity_3 geo_1 commit 300
3 source_1 opportunity_4 geo_1 won 400
4 source_1 opportunity_5 geo_1 omitted 500
5 source_2 opportunity_6 geo_2 pipeline 600
6 source_2 opportunity_7 geo_2 upside 700
7 source_2 opportunity_8 geo_2 commit 800
8 source_2 opportunity_9 geo_2 won 900
9 source_2 opportunity_10 geo_2 omitted 1000
10 source_3 opportunity_11 geo_3 pipeline 1100
11 source_3 opportunity_12 geo_3 upside 1200
12 source_3 opportunity_13 geo_3 commit 1300
13 source_3 opportunity_14 geo_3 won 1400
14 source_3 opportunity_15 geo_3 omitted 1500
15 source_4 opportunity_16 geo_4 pipeline 1600
16 source_4 opportunity_17 geo_4 upside 1700
17 source_4 opportunity_18 geo_4 commit 1800
18 source_4 opportunity_19 geo_4 won 1900
19 source_4 opportunity_20 geo_4 omitted 2000
I am loading that dataset into pandas by reading it from excel into a dataframe, df:
import pandas as pd
df = pd.read_excel('C:/Users/nick/Downloads/example_data.xlsx')
I want to show the data first indexed on the 'Geo Map' field, and then on the 'Forecast Category' field, but I want to have control over the order in which those values are sorted. In order to make sure I can do this, I create a keymap for both of those fields in the two formats I think I could need:
regions_array = {'geo_2': 1, 'geo_1': 2, 'geo_4': 3, 'geo_3':4}
regions_list = ['geo_2', 'geo_1', 'geo_4', 'geo_3']
categories_array = {'pipeline': 1, 'upside': 2, 'commit': 3, 'won': 4, 'omitted': 5}
categories_list = ['pipeline', 'upside', 'commit', 'won', 'omitted']
When I go to move the data into a pivot table, I am able to achieve the multi-index structure I am after via a pivot with:
regional_numbers = pd.pivot_table(df, index=['Geo Map', 'Forecast Category'],
columns='Data', values='Total', fill_value=0, aggfunc='sum')
regional_numbers
result:
Data source_1 source_2 source_3 source_4
Geo Map Forecast Category
geo_1 commit 300 0 0 0
omitted 500 0 0 0
pipeline 100 0 0 0
upside 200 0 0 0
won 400 0 0 0
geo_2 commit 0 800 0 0
omitted 0 1000 0 0
pipeline 0 600 0 0
upside 0 700 0 0
won 0 900 0 0
geo_3 commit 0 0 1300 0
omitted 0 0 1500 0
pipeline 0 0 1100 0
upside 0 0 1200 0
won 0 0 1400 0
geo_4 commit 0 0 0 1800
omitted 0 0 0 2000
pipeline 0 0 0 1600
upside 0 0 0 1700
won 0 0 0 1900
However, now I need to re-order my indices by first the 'Geo Map' and then by the 'Forecast Category'. I have tried to sort_index, reindex, but the problem is once I sort based on the first everything is fine but the second level breaks the frame.
What I would like my result to look like is based on sorting from my keys for first the 'Geo Map' and then 'Forecast Category':
Data source_1 source_2 source_3 source_4
Geo Map Forecast Category
geo_2 pipeline 0 600 0 0
upside 0 700 0 0
commit 0 800 0 0
won 0 900 0 0
omitted 0 1000 0 0
geo_1 pipeline 100 0 0 0
upside 200 0 0 0
commit 300 0 0 0
won 400 0 0 0
omitted 500 0 0 0
geo_4 pipeline 0 0 0 1600
upside 0 0 0 1700
commit 0 0 0 1800
won 0 0 0 1900
omitted 0 0 0 2000
geo_3 pipeline 0 0 1100 0
upside 0 0 1200 0
commit 0 0 1300 0
won 0 0 1400 0
omitted 0 0 1500 0
Extra Credit
If I could then add total rows into the equation, that would be the ultimate end state I am looking for. I was trying to add margins=True and margin_name in the pivot arguments but then I needed to sort the Total column in the keys which I ideally want to avoid. End state for the extra credit would be:
Data source_1 source_2 source_3 source_4
Geo Map Forecast Category
geo_2 pipeline 0 600 0 0
upside 0 700 0 0
commit 0 800 0 0
won 0 900 0 0
omitted 0 1000 0 0
Total 0 4000 0 0
geo_1 ... ... ... ... ...
Total X X X X
geo_4 ... ... ... ... ...
Total X X X X
geo_3 ... ... ... ... ...
Total X X X X
TOTALS X X X X
It would also be nice to have the totals to the right of all the sources as well on a row level, but having trouble getting that to show correctly.
If the way I am asking for the totals is not possible, I am sure I can figure out totaling the values I want separately but would be nice.
Thanks for any help you can provide!
EDITED to fix my expected output.
CodePudding user response:
Note: categories_array
have keys starting with capital letters, change them to be the same as in your dataframe.
Try:
def add_total(x):
x = pd.concat(
[
x,
pd.DataFrame(
{c: [x[c].sum()] for c in x.columns}, index=[("", "Total")]
),
]
)
return x
regional_numbers = regional_numbers.reindex(
sorted(
regional_numbers.index,
key=lambda k: (regions_array[k[0]], categories_array[k[1]]),
)
)
regional_numbers = (
regional_numbers.groupby(level=0, as_index=False, sort=False)
.apply(add_total)
.droplevel(0)
)
regional_numbers["TOTAL"] = regional_numbers.sum(axis=1)
regional_numbers = pd.concat(
[
regional_numbers,
pd.DataFrame(
{c: [regional_numbers[c].sum()] for c in regional_numbers.columns},
index=[("TOTAL", "")],
),
]
)
print(regional_numbers)
Prints:
source_1 source_2 source_3 source_4 TOTAL
Geo_Map Forecast_Category
geo_2 pipeline 0 600 0 0 600
upside 0 700 0 0 700
commit 0 800 0 0 800
won 0 900 0 0 900
omitted 0 1000 0 0 1000
Total 0 4000 0 0 4000
geo_1 pipeline 100 0 0 0 100
upside 200 0 0 0 200
commit 300 0 0 0 300
won 400 0 0 0 400
omitted 500 0 0 0 500
Total 1500 0 0 0 1500
geo_4 pipeline 0 0 0 1600 1600
upside 0 0 0 1700 1700
commit 0 0 0 1800 1800
won 0 0 0 1900 1900
omitted 0 0 0 2000 2000
Total 0 0 0 9000 9000
geo_3 pipeline 0 0 1100 0 1100
upside 0 0 1200 0 1200
commit 0 0 1300 0 1300
won 0 0 1400 0 1400
omitted 0 0 1500 0 1500
Total 0 0 6500 0 6500
TOTAL 3000 8000 13000 18000 42000
CodePudding user response:
Try using pd.MultiIndex.from_products then reindex:
idx = pd.MultiIndex.from_product([regions_list, categories_list], names=['Geo Map', 'Forecast Category'])
df_out = regional_numbers.reindex(idx)
df_out
Output:
Data source_1 source_2 source_3 source_4
Geo Map Forecast Category
geo_2 pipeline 0 600 0 0
upside 0 700 0 0
commit 0 800 0 0
won 0 900 0 0
omitted 0 1000 0 0
geo_1 pipeline 100 0 0 0
upside 200 0 0 0
commit 300 0 0 0
won 400 0 0 0
omitted 500 0 0 0
geo_4 pipeline 0 0 0 1600
upside 0 0 0 1700
commit 0 0 0 1800
won 0 0 0 1900
omitted 0 0 0 2000
geo_3 pipeline 0 0 1100 0
upside 0 0 1200 0
commit 0 0 1300 0
won 0 0 1400 0
omitted 0 0 1500 0