I have the following pandas dataframe X
in long format:
chid.var id.var alt.var wei odd cou cla pla
1 1 1 130 6.6 0 5 FALSE
1 1 2 119 224 0 5 FALSE
1 1 3 126 10 0 5 TRUE
2 1 1 130 15 1 4 TRUE
2 1 2 124 6.9 1 4 FALSE
3 1 1 122 39 0 4 FALSE
3 1 2 124 2 0 4 TRUE
4 1 1 121 5.3 0 1 TRUE
5 1 1 112 22 0 1 FALSE
5 1 2 119 7.2 0 1 TRUE
5 1 3 119 5.9 0 1 FALSE
6 1 1 133 1.7 0 1 TRUE
7 1 1 113 72 0 1 FALSE
7 1 2 113 12 0 1 TRUE
8 1 1 121 13 1 4 TRUE
8 1 2 125 6 1 4 FALSE
8 1 3 127 6.3 1 4 FALSE
and I would like to reshape it into the following wide format:
chid.var id.var wei.1 wei.2 wei.3 odd.1 odd.2 odd.3 cou cla pla
1 1 130 119 126 6.6 224 10 0 5 3
2 1 130 124 NaN 15 6.9 NaN 1 4 1
3 1 122 124 NaN 39 2 NaN 0 4 2
4 1 121 NaN NaN 5.3 NaN NaN 0 1 1
5 1 112 119 119 22 7.2 5.9 0 1 2
6 1 133 NaN NaN 1.7 NaN NaN 0 1 1
7 1 113 113 NaN 72 12 NaN 0 1 2
8 1 121 125 127 13 6 6.3 1 4 1
I did a little bit digging and I know I should use the pivot_table
function and index on alt.var
, unfortunately the alt.var
is not 1 dimensional (it varies from size 1 to size 3), and also I would like the column names to be like the above but pivot_table
doesn't seem to be able to do that (my actual dataset is much larger and hence I can't just rename the column names one by one)
And I have no idea how to efficiently transform the pla
column into the desired one.
Thanks in advance.
CodePudding user response:
There are two parts: pivot
and pla
that are fairly independent:
out = df.pivot_table(index=['chid.var','id.var', 'cou','cla'], columns='alt.var',
values=['odd','wei'])
out.columns = [f'{x}.{y}' for x,y in out.columns]
out = out.reset_index()
# pla is just the count of pairs `(chid.var, id.var)`
out.merge(df.groupby(['chid.var','id.var'])['pla'].size(), on=['chid.var', 'id.var'])
Output:
chid.var id.var cou cla odd.1 odd.2 odd.3 wei.1 wei.2 wei.3 pla
0 1 1 0 5 6.6 224.0 10.0 130.0 119.0 126.0 3
1 2 1 1 4 15.0 6.9 NaN 130.0 124.0 NaN 2
2 3 1 0 4 39.0 2.0 NaN 122.0 124.0 NaN 2
3 4 1 0 1 5.3 NaN NaN 121.0 NaN NaN 1
4 5 1 0 1 22.0 7.2 5.9 112.0 119.0 119.0 3
5 6 1 0 1 1.7 NaN NaN 133.0 NaN NaN 1
6 7 1 0 1 72.0 12.0 NaN 113.0 113.0 NaN 2
7 8 1 1 4 13.0 6.0 6.3 121.0 125.0 127.0 3
CodePudding user response:
You can also use pivot_wider
from pyjanitor
:
import janitor
df1 = df.pivot_wider(['chid.var', 'id.var', 'cou', 'cla'],
names_from = 'alt.var', values_from = ['wei', 'odd'])
df1['pla'] = df.groupby(['chid.var', 'id.var'])['pla'].apply(lambda x: np.where(x)[0][0] 1)
df1
chid.var id.var cou cla wei_1 wei_2 wei_3 odd_1 odd_2 odd_3 pla
0 1 1 0 5 130.0 119.0 126.0 6.6 224.0 10.0 NaN
1 2 1 1 4 130.0 124.0 NaN 15.0 6.9 NaN 3.0
2 3 1 0 4 122.0 124.0 NaN 39.0 2.0 NaN 1.0
3 4 1 0 1 121.0 NaN NaN 5.3 NaN NaN 2.0
4 5 1 0 1 112.0 119.0 119.0 22.0 7.2 5.9 1.0
5 6 1 0 1 133.0 NaN NaN 1.7 NaN NaN 2.0
6 7 1 0 1 113.0 113.0 NaN 72.0 12.0 NaN 1.0
7 8 1 1 4 121.0 125.0 127.0 13.0 6.0 6.3 2.0
Note that you can always specify the separator. the default is _