I’m trying to get from a
to b
. I got a Pandas data frame similar to the a below.
data={'col1':['N1','N1','N2','N2', 'N2','N3'],
'col2':['DE','NO','DE','NO', 'IT','DE'],
'col3':[7, 5, 4, 1, 2, 8],
'col3_sum':[12, 12, 7, 7, 7, 8],
'col4':[0.6, 0.2, 0.7, 0.1, 0.2, 0.6],
'col4_sum':[0.8, 0.8, 1.0, 1.0, 1.0, 0.6],
'col5':[1,2,3,4,5,6]}
a=pd.DataFrame(data)
print(a)
col1 col2 col3 col3_sum col4 col4_sum col5
0 N1 DE 7 12 0.6 0.8 1
1 N1 NO 5 12 0.2 0.8 2
2 N2 DE 4 7 0.7 1.0 3
3 N2 NO 1 7 0.1 1.0 4
4 N2 IT 2 7 0.2 1.0 5
5 N3 DE 8 8 0.6 0.6 6
I realize I’ve backed myself into a corner by computing sums in a flat file. I’m new to Python. I guess I should create the sums when I'm done pivoting?
What I am stuck in is this wrong b
struggle,
b = df.pivot_table(index=['col1'],
values=['col3', 'col3_sum','col4', 'col4_sum'],
columns='col2')
# or
b = pd.pivot_table(a,index=['col1', 'col2'], columns=['col3', 'col4'],
aggfunc=len).reset_index()
# this makes senst to me, but not to python
a.pivot(index=['col1', 'col2'], columns='col2', values=['col3', 'col4'])
# print(b) # where I'm stuck at ...
I would like to get to something like this b
,
print(b) # my goal
col1 var var_sum DE NO IT
N1 col3 12 7 5
N1 col4 0.8 0.6 0.2
N2 col3 7 4 1 2
N2 col4 1.0 0.7 0.1 0.2
N3 col3 8 8
N3 col4 0.6 0.6
I'm not sure what to search for (some of the maybe relevant questions has way too much complexity for my to be able to extract what I need, at least at the moment). I've looked a lot at this answer, maybe I should find a way using .groupby()
CodePudding user response:
Maybe you can compute the sum afterwards:
out = pd.melt(a, ["col1", "col2"], ["col3", "col4"]).pivot(
["col1", "variable"], "col2"
)
out["var_sum"] = out.sum(axis=1)
out = out.reset_index()
out.index.name, out.columns.name = None, None
out.columns = [
f"{a}_{b}".replace("value", "").strip("_") for a, b in out.columns
]
print(out)
Prints:
col1 variable DE IT NO var_sum
0 N1 col3 7.0 NaN 5.0 12.0
1 N1 col4 0.6 NaN 0.2 0.8
2 N2 col3 4.0 2.0 1.0 7.0
3 N2 col4 0.7 0.2 0.1 1.0
4 N3 col3 8.0 NaN NaN 8.0
5 N3 col4 0.6 NaN NaN 0.6
CodePudding user response:
Andrej Kesely did answer my questions. I did however chose a solution that preserved the MultiIndex pandas.core.frame.DataFrame
, but did an additional bit of housekeeping, to get the order of col2
the way I wanted it and the var_sum
as the first column. Like this,
b = pd.melt(a, ["col1", "col2"], ["col3", "col4"]).pivot(
["col1", "variable"], "col2"
)
b = b.reindex(columns=a['col2'].unique(), level='col2')
sums = b.groupby(level=0, axis=1).sum()
sums.columns = pd.MultiIndex.from_product([sums.columns, ['var_sum']])
b = pd.concat([sums, b], axis=1)
b
value
var_sum DE NO IT
col1 variable
N1 col3 12.0 7.0 5.0 NaN
col4 0.8 0.6 0.2 NaN
N2 col3 7.0 4.0 1.0 2.0
col4 1.0 0.7 0.1 0.2
N3 col3 8.0 8.0 NaN NaN
col4 0.6 0.6 NaN NaN