Home > OS >  Python (Pandas) pivot datframe, some sums keeping the order
Python (Pandas) pivot datframe, some sums keeping the order

Time:09-26

I’m trying to get from a to b. I got a Pandas data frame similar to the a below.

data={'col1':['N1','N1','N2','N2', 'N2','N3'],
        'col2':['DE','NO','DE','NO', 'IT','DE'],
        'col3':[7, 5, 4, 1, 2, 8],
        'col3_sum':[12, 12, 7, 7, 7, 8],
        'col4':[0.6, 0.2, 0.7, 0.1, 0.2, 0.6],
        'col4_sum':[0.8, 0.8, 1.0, 1.0, 1.0, 0.6],
        'col5':[1,2,3,4,5,6]}
a=pd.DataFrame(data)
print(a)
  col1 col2  col3  col3_sum  col4  col4_sum  col5
0   N1   DE     7        12   0.6       0.8     1
1   N1   NO     5        12   0.2       0.8     2
2   N2   DE     4         7   0.7       1.0     3
3   N2   NO     1         7   0.1       1.0     4
4   N2   IT     2         7   0.2       1.0     5
5   N3   DE     8         8   0.6       0.6     6

I realize I’ve backed myself into a corner by computing sums in a flat file. I’m new to Python. I guess I should create the sums when I'm done pivoting?

What I am stuck in is this wrong b struggle,

b = df.pivot_table(index=['col1'], 
                    values=['col3', 'col3_sum','col4', 'col4_sum'],
                    columns='col2')
# or 
b = pd.pivot_table(a,index=['col1', 'col2'], columns=['col3', 'col4'],
                    aggfunc=len).reset_index()
# this makes senst to me, but not to python
a.pivot(index=['col1', 'col2'], columns='col2', values=['col3', 'col4'])
# print(b) # where I'm stuck at ... 

I would like to get to something like this b,

print(b) # my goal
col1    var     var_sum     DE      NO      IT
N1      col3    12          7       5   
N1      col4    0.8         0.6     0.2 
N2      col3    7           4       1       2
N2      col4    1.0         0.7     0.1     0.2
N3      col3    8           8       
N3      col4    0.6         0.6     

I'm not sure what to search for (some of the maybe relevant questions has way too much complexity for my to be able to extract what I need, at least at the moment). I've looked a lot at this answer, maybe I should find a way using .groupby()

CodePudding user response:

Maybe you can compute the sum afterwards:

out = pd.melt(a, ["col1", "col2"], ["col3", "col4"]).pivot(
    ["col1", "variable"], "col2"
)
out["var_sum"] = out.sum(axis=1)

out = out.reset_index()
out.index.name, out.columns.name = None, None
out.columns = [
    f"{a}_{b}".replace("value", "").strip("_") for a, b in out.columns
]

print(out)

Prints:

  col1 variable   DE   IT   NO  var_sum
0   N1     col3  7.0  NaN  5.0     12.0
1   N1     col4  0.6  NaN  0.2      0.8
2   N2     col3  4.0  2.0  1.0      7.0
3   N2     col4  0.7  0.2  0.1      1.0
4   N3     col3  8.0  NaN  NaN      8.0
5   N3     col4  0.6  NaN  NaN      0.6

CodePudding user response:

Andrej Kesely did answer my questions. I did however chose a solution that preserved the MultiIndex pandas.core.frame.DataFrame, but did an additional bit of housekeeping, to get the order of col2 the way I wanted it and the var_sum as the first column. Like this,

b = pd.melt(a, ["col1", "col2"], ["col3", "col4"]).pivot(
    ["col1", "variable"], "col2"
)

b = b.reindex(columns=a['col2'].unique(), level='col2')

sums = b.groupby(level=0, axis=1).sum()
sums.columns = pd.MultiIndex.from_product([sums.columns, ['var_sum']])
b = pd.concat([sums, b], axis=1)
b
                value
              var_sum   DE   NO   IT
col1 variable
N1   col3        12.0  7.0  5.0  NaN
     col4         0.8  0.6  0.2  NaN
N2   col3         7.0  4.0  1.0  2.0
     col4         1.0  0.7  0.1  0.2
N3   col3         8.0  8.0  NaN  NaN
     col4         0.6  0.6  NaN  NaN
  • Related