How to melt a dataframe based on a pattern of the column name?-CodePudding

I'm trying to melt a dataframe

Year | Baseline_low | Baseline_high | Overall_low | Overall_high
================================================================    
2022 |            1 |             2 |           3 |            4
2023 |            5 |             6 |           7 |            8

to this:

2022 | Baseline |  low | 1
2022 | Baseline | high | 2

There are a lot more columns that end with _low or _high than I've listed so I would like to use a pattern to split the column.

Is it possible to do this with melt?

I've seen this similar question but it has a slightly different structure to my dataframe.

CodePudding user response：

If you aim to be pure pandas, then yes melt is they key:

import io
import pandas as pd

df = pd.read_csv(io.StringIO("""Year | Baseline_low | Baseline_high | Overall_low | Overall_high  
2022 |            1 |             2 |           3 |            4
2023 |            5 |             6 |           7 |            8"""), sep="|")
df.columns = [column.strip() for column in df.columns]

final = df.melt(id_vars=["Year"])
final["Type"] = final["variable"].apply(lambda x: x.split("_")[0])
final["State"] = final["variable"].apply(lambda x: x.split("_")[1])

Then just split information you need based on column names, it returns:

   Year       variable  value      Type State
0  2022   Baseline_low      1  Baseline   low
1  2023   Baseline_low      5  Baseline   low
2  2022  Baseline_high      2  Baseline  high
3  2023  Baseline_high      6  Baseline  high
4  2022    Overall_low      3   Overall   low
5  2023    Overall_low      7   Overall   low
6  2022   Overall_high      4   Overall  high
7  2023   Overall_high      8   Overall  high

CodePudding user response：

You cannot achieve this with melt only. You need post-processing or to use a reshaping with a MultiIndex.

It's however a good use case for janitor's pivot_longer:

import janitor

df.pivot_longer(
    index='Year', 
    names_to=('.value', 'new'), 
    names_sep='_', 
    sort_by_appearance=True
)

output:

   Year   new  Baseline  Overall
0  2022   low         1        3
1  2022  high         2        4
2  2023   low         5        7
3  2023  high         6        8

CodePudding user response：

#melt
df2=df.melt('Year')

# split the columns
df2[['col','level']]=df2['variable'].str.split('_', expand=True)

# pivot
df2.pivot(index=['Year','level'], columns='col', values='value').reset_index().rename_axis(columns=None) 

#rename_axis, thanks to @mozway!

    Year    level   Baseline    Overall
0   2022    high           2    4
1   2022    low            1    3
2   2023    high           6    8
3   2023    low            5    7