pandas convert columns with same name but different suffixes to indices and index to columns-CodePudding

I have a problem in which I am trying to change the shape of data, but trying to have column names on row level without having suffixes attached to them.

e.g.

My original df looks like this

 ------ -------- -------- -------- -------- -------- -------- 
| Year | col1_a | col2_a | col3_a | col1_b | col2_b | col3_b |
 ------ -------- -------- -------- -------- -------- -------- 
| 2017 |    1.4 |    555 |      5 |    123 |   55.5 |     80 |
| 2018 |    1.5 |    444 |      6 |    456 |   56.5 |     90 |
| 2019 |    0.6 |    333 |      8 |    789 |   57.5 |    100 |
 ------ -------- -------- -------- -------- -------- --------

and I am trying to reshape it as

 ------ ------ ------ ------ ------ ------ ------ 
|      |  a   |  a   |   a  |  b   |   b  |   b  |
 ------ ------ ------ ------ ------ ------ ------ 
|        2017 | 2018 | 2019 | 2017 | 2018 | 2019 |
| col1 |  1.4 |  1.5 |  0.6 |  123 |  456 |  789 |
| col2 |  555 |  444 |  333 | 55.5 | 56.5 | 57.5 |
| col3 |    5 |    6 |    8 |   80 |   90 |  100 |
 ------ ------ ------ ------ ------ ------ ------

I can use pivot_table but the thing is I am not able to get unique column names at index level. Any idea which method can be used to reshape this?

CodePudding user response：

We can set_index to address any columns which do not have a set pattern of prefix_suffix. We can then use str.rsplit with expand=True and n=1 to create MultiIndex columns. rsplit with n=1 ensures we only split one time on the rightmost underscore. Then stack to convert the new column index level to a row index level and transpose to get to the correct general shape:

df = df.set_index('Year')
df.columns = df.columns.str.rsplit('_', expand=True, n=1)
df = df.stack(level=1).transpose()

Year   2017          2018          2019       
          a      b      a      b      a      b
col1    1.4  123.0    1.5  456.0    0.6  789.0
col2  555.0   55.5  444.0   56.5  333.0   57.5
col3    5.0   80.0    6.0   90.0    8.0  100.0

We can further clean the data with a swaplevel and sort_index to get the headers to match the expected order. Also rename_axis to remove any unneeded column names:

df = df.set_index('Year')
df.columns = df.columns.str.rsplit('_', expand=True, n=1)
df = (
    df.stack(level=1)
        .transpose()
        .sort_index(axis=1, level=1)
        .swaplevel(axis=1)
        .rename_axis(columns=[None, None])
)

          a                    b              
       2017   2018   2019   2017   2018   2019
col1    1.4    1.5    0.6  123.0  456.0  789.0
col2  555.0  444.0  333.0   55.5   56.5   57.5
col3    5.0    6.0    8.0   80.0   90.0  100.0

Setup used:

import pandas as pd

df = pd.DataFrame({
    'Year': [2017, 2018, 2019], 'col1_a': [1.4, 1.5, 0.6],
    'col2_a': [555, 444, 333], 'col3_a': [5, 6, 8], 'col1_b': [123, 456, 789],
    'col2_b': [55.5, 56.5, 57.5], 'col3_b': [80, 90, 100]
})

CodePudding user response：

The solution requires converting to long form, before reshaping into a wide form. One option is a combination of pivot_longer from pyjanitor, with pivot:

# pip install pyjanitor
import pandas as pd
import janitor

( df
.pivot_longer(index='Year', names_to=('col', '.value'), names_sep='_')
.pivot(index='col', columns='Year')
.rename_axis(columns=[None, None])
)

          a                    b
       2017   2018   2019   2017   2018   2019
col
col1    1.4    1.5    0.6  123.0  456.0  789.0
col2  555.0  444.0  333.0   55.5   56.5   57.5
col3    5.0    6.0    8.0   80.0   90.0  100.0