I have a problem in which I am trying to change the shape of data, but trying to have column names on row level without having suffixes attached to them.
e.g.
My original df looks like this
------ -------- -------- -------- -------- -------- --------
| Year | col1_a | col2_a | col3_a | col1_b | col2_b | col3_b |
------ -------- -------- -------- -------- -------- --------
| 2017 | 1.4 | 555 | 5 | 123 | 55.5 | 80 |
| 2018 | 1.5 | 444 | 6 | 456 | 56.5 | 90 |
| 2019 | 0.6 | 333 | 8 | 789 | 57.5 | 100 |
------ -------- -------- -------- -------- -------- --------
and I am trying to reshape it as
------ ------ ------ ------ ------ ------ ------
| | a | a | a | b | b | b |
------ ------ ------ ------ ------ ------ ------
| 2017 | 2018 | 2019 | 2017 | 2018 | 2019 |
| col1 | 1.4 | 1.5 | 0.6 | 123 | 456 | 789 |
| col2 | 555 | 444 | 333 | 55.5 | 56.5 | 57.5 |
| col3 | 5 | 6 | 8 | 80 | 90 | 100 |
------ ------ ------ ------ ------ ------ ------
I can use pivot_table but the thing is I am not able to get unique column names at index level. Any idea which method can be used to reshape this?
CodePudding user response:
We can set_index to address any columns which do not have a set pattern of prefix_suffix
. We can then use str.rsplit with expand=True
and n=1
to create MultiIndex columns. rsplit
with n=1
ensures we only split one time on the rightmost underscore. Then stack to convert the new column index level to a row index level and transpose to get to the correct general shape:
df = df.set_index('Year')
df.columns = df.columns.str.rsplit('_', expand=True, n=1)
df = df.stack(level=1).transpose()
Year 2017 2018 2019
a b a b a b
col1 1.4 123.0 1.5 456.0 0.6 789.0
col2 555.0 55.5 444.0 56.5 333.0 57.5
col3 5.0 80.0 6.0 90.0 8.0 100.0
We can further clean the data with a swaplevel and sort_index to get the headers to match the expected order. Also rename_axis to remove any unneeded column names:
df = df.set_index('Year')
df.columns = df.columns.str.rsplit('_', expand=True, n=1)
df = (
df.stack(level=1)
.transpose()
.sort_index(axis=1, level=1)
.swaplevel(axis=1)
.rename_axis(columns=[None, None])
)
a b
2017 2018 2019 2017 2018 2019
col1 1.4 1.5 0.6 123.0 456.0 789.0
col2 555.0 444.0 333.0 55.5 56.5 57.5
col3 5.0 6.0 8.0 80.0 90.0 100.0
Setup used:
import pandas as pd
df = pd.DataFrame({
'Year': [2017, 2018, 2019], 'col1_a': [1.4, 1.5, 0.6],
'col2_a': [555, 444, 333], 'col3_a': [5, 6, 8], 'col1_b': [123, 456, 789],
'col2_b': [55.5, 56.5, 57.5], 'col3_b': [80, 90, 100]
})
CodePudding user response:
The solution requires converting to long form, before reshaping into a wide form. One option is a combination of pivot_longer
from pyjanitor
, with pivot
:
# pip install pyjanitor
import pandas as pd
import janitor
( df
.pivot_longer(index='Year', names_to=('col', '.value'), names_sep='_')
.pivot(index='col', columns='Year')
.rename_axis(columns=[None, None])
)
a b
2017 2018 2019 2017 2018 2019
col
col1 1.4 1.5 0.6 123.0 456.0 789.0
col2 555.0 444.0 333.0 55.5 56.5 57.5
col3 5.0 6.0 8.0 80.0 90.0 100.0