Home > Mobile >  pandas convert columns with same name but different suffixes to indices and index to columns
pandas convert columns with same name but different suffixes to indices and index to columns

Time:02-14

I have a problem in which I am trying to change the shape of data, but trying to have column names on row level without having suffixes attached to them.

e.g.

My original df looks like this

 ------ -------- -------- -------- -------- -------- -------- 
| Year | col1_a | col2_a | col3_a | col1_b | col2_b | col3_b |
 ------ -------- -------- -------- -------- -------- -------- 
| 2017 |    1.4 |    555 |      5 |    123 |   55.5 |     80 |
| 2018 |    1.5 |    444 |      6 |    456 |   56.5 |     90 |
| 2019 |    0.6 |    333 |      8 |    789 |   57.5 |    100 |
 ------ -------- -------- -------- -------- -------- -------- 

and I am trying to reshape it as

 ------ ------ ------ ------ ------ ------ ------ 
|      |  a   |  a   |   a  |  b   |   b  |   b  |
 ------ ------ ------ ------ ------ ------ ------ 
|        2017 | 2018 | 2019 | 2017 | 2018 | 2019 |
| col1 |  1.4 |  1.5 |  0.6 |  123 |  456 |  789 |
| col2 |  555 |  444 |  333 | 55.5 | 56.5 | 57.5 |
| col3 |    5 |    6 |    8 |   80 |   90 |  100 |
 ------ ------ ------ ------ ------ ------ ------ 

I can use pivot_table but the thing is I am not able to get unique column names at index level. Any idea which method can be used to reshape this?

CodePudding user response:

We can set_index to address any columns which do not have a set pattern of prefix_suffix. We can then use str.rsplit with expand=True and n=1 to create MultiIndex columns. rsplit with n=1 ensures we only split one time on the rightmost underscore. Then stack to convert the new column index level to a row index level and transpose to get to the correct general shape:

df = df.set_index('Year')
df.columns = df.columns.str.rsplit('_', expand=True, n=1)
df = df.stack(level=1).transpose()
Year   2017          2018          2019       
          a      b      a      b      a      b
col1    1.4  123.0    1.5  456.0    0.6  789.0
col2  555.0   55.5  444.0   56.5  333.0   57.5
col3    5.0   80.0    6.0   90.0    8.0  100.0

We can further clean the data with a swaplevel and sort_index to get the headers to match the expected order. Also rename_axis to remove any unneeded column names:

df = df.set_index('Year')
df.columns = df.columns.str.rsplit('_', expand=True, n=1)
df = (
    df.stack(level=1)
        .transpose()
        .sort_index(axis=1, level=1)
        .swaplevel(axis=1)
        .rename_axis(columns=[None, None])
)
          a                    b              
       2017   2018   2019   2017   2018   2019
col1    1.4    1.5    0.6  123.0  456.0  789.0
col2  555.0  444.0  333.0   55.5   56.5   57.5
col3    5.0    6.0    8.0   80.0   90.0  100.0

Setup used:

import pandas as pd

df = pd.DataFrame({
    'Year': [2017, 2018, 2019], 'col1_a': [1.4, 1.5, 0.6],
    'col2_a': [555, 444, 333], 'col3_a': [5, 6, 8], 'col1_b': [123, 456, 789],
    'col2_b': [55.5, 56.5, 57.5], 'col3_b': [80, 90, 100]
})

CodePudding user response:

The solution requires converting to long form, before reshaping into a wide form. One option is a combination of pivot_longer from pyjanitor, with pivot:

# pip install pyjanitor
import pandas as pd
import janitor

( df
.pivot_longer(index='Year', names_to=('col', '.value'), names_sep='_')
.pivot(index='col', columns='Year')
.rename_axis(columns=[None, None])
)

          a                    b
       2017   2018   2019   2017   2018   2019
col
col1    1.4    1.5    0.6  123.0  456.0  789.0
col2  555.0  444.0  333.0   55.5   56.5   57.5
col3    5.0    6.0    8.0   80.0   90.0  100.0
  • Related