Reshaping long to a specific wide format-CodePudding

I have the following pandas dataframe X in long format:

chid.var    id.var  alt.var wei odd cou cla pla
1           1       1       130 6.6 0   5   FALSE
1           1       2       119 224 0   5   FALSE
1           1       3       126 10  0   5   TRUE
2           1       1       130 15  1   4   TRUE
2           1       2       124 6.9 1   4   FALSE
3           1       1       122 39  0   4   FALSE
3           1       2       124 2   0   4   TRUE
4           1       1       121 5.3 0   1   TRUE
5           1       1       112 22  0   1   FALSE
5           1       2       119 7.2 0   1   TRUE
5           1       3       119 5.9 0   1   FALSE
6           1       1       133 1.7 0   1   TRUE
7           1       1       113 72  0   1   FALSE
7           1       2       113 12  0   1   TRUE
8           1       1       121 13  1   4   TRUE
8           1       2       125 6   1   4   FALSE
8           1       3       127 6.3 1   4   FALSE

and I would like to reshape it into the following wide format:

chid.var    id.var  wei.1   wei.2   wei.3   odd.1   odd.2   odd.3   cou cla pla
1           1       130     119     126     6.6     224     10      0   5   3
2           1       130     124     NaN     15      6.9     NaN     1   4   1
3           1       122     124     NaN     39      2       NaN     0   4   2
4           1       121     NaN     NaN     5.3     NaN     NaN     0   1   1
5           1       112     119     119     22      7.2     5.9     0   1   2
6           1       133     NaN     NaN     1.7     NaN     NaN     0   1   1
7           1       113     113     NaN     72      12      NaN     0   1   2
8           1       121     125     127     13      6       6.3     1   4   1

I did a little bit digging and I know I should use the pivot_table function and index on alt.var, unfortunately the alt.var is not 1 dimensional (it varies from size 1 to size 3), and also I would like the column names to be like the above but pivot_table doesn't seem to be able to do that (my actual dataset is much larger and hence I can't just rename the column names one by one)

And I have no idea how to efficiently transform the pla column into the desired one.

Thanks in advance.

CodePudding user response：

There are two parts: pivot and pla that are fairly independent:

out = df.pivot_table(index=['chid.var','id.var', 'cou','cla'], columns='alt.var', 
                     values=['odd','wei'])
out.columns = [f'{x}.{y}' for x,y in out.columns]
out = out.reset_index()

# pla is just the count of pairs `(chid.var, id.var)`
out.merge(df.groupby(['chid.var','id.var'])['pla'].size(), on=['chid.var', 'id.var'])

Output:

   chid.var  id.var  cou  cla  odd.1  odd.2  odd.3  wei.1  wei.2  wei.3  pla
0         1       1    0    5    6.6  224.0   10.0  130.0  119.0  126.0    3
1         2       1    1    4   15.0    6.9    NaN  130.0  124.0    NaN    2
2         3       1    0    4   39.0    2.0    NaN  122.0  124.0    NaN    2
3         4       1    0    1    5.3    NaN    NaN  121.0    NaN    NaN    1
4         5       1    0    1   22.0    7.2    5.9  112.0  119.0  119.0    3
5         6       1    0    1    1.7    NaN    NaN  133.0    NaN    NaN    1
6         7       1    0    1   72.0   12.0    NaN  113.0  113.0    NaN    2
7         8       1    1    4   13.0    6.0    6.3  121.0  125.0  127.0    3

CodePudding user response：

You can also use pivot_wider from pyjanitor:

import janitor

df1 = df.pivot_wider(['chid.var', 'id.var', 'cou', 'cla'], 
         names_from = 'alt.var', values_from = ['wei', 'odd'])

df1['pla'] = df.groupby(['chid.var', 'id.var'])['pla'].apply(lambda x: np.where(x)[0][0]   1)

df1

   chid.var  id.var  cou  cla  wei_1  wei_2  wei_3  odd_1  odd_2  odd_3  pla
0         1       1    0    5  130.0  119.0  126.0    6.6  224.0   10.0  NaN
1         2       1    1    4  130.0  124.0    NaN   15.0    6.9    NaN  3.0
2         3       1    0    4  122.0  124.0    NaN   39.0    2.0    NaN  1.0
3         4       1    0    1  121.0    NaN    NaN    5.3    NaN    NaN  2.0
4         5       1    0    1  112.0  119.0  119.0   22.0    7.2    5.9  1.0
5         6       1    0    1  133.0    NaN    NaN    1.7    NaN    NaN  2.0
6         7       1    0    1  113.0  113.0    NaN   72.0   12.0    NaN  1.0
7         8       1    1    4  121.0  125.0  127.0   13.0    6.0    6.3  2.0

Note that you can always specify the separator. the default is _