Home > OS >  Melt/Stack/Reshape multi header dataframe
Melt/Stack/Reshape multi header dataframe

Time:03-17

I've scraped from data from excel that has more than one header row of data, with some nans in one row:

df = pd.read_excel(file, header=None)

0  nan  nan   jan20  jan20  feb20  feb20
1  ID   Name   GrpA   GrpB   GrpA   GrpB
2  1    One     1       2      3     4
3  2    Two     5       6      7     8

Each ID/Name pair has 4 pieces of data it needs to be associated with. The goal for the data is to look like:

ID Name
1   One jan20 grpA 1
1   One jan20 grpB 2
1   One feb20 grpA 3
1   One feb20 grpB 4
2   Two jan20 grpA 5
2   Two jan20 grpB 6 
2   Two feb20 grpA 7
2   Two feb20 grpB 8

Note when this is scraped, I don't indicate the header rows upon import, as there is some cleaning of the data to do. I've tried lots of various combinations of pd.melt, pd.stack, pd.unstack, but not getting there. I've tried making the first two rows both header rows, but then I have problems with the keys.

It seems like .melt might be the right answer because you can choose a col_level and could name the resulting columns, but when I try that route I can't access the dates (first row) it seems

df.melt(col_level=1, id_vars=['ID'])

Any ideas on this one?

CodePudding user response:

Assuming the lines of your sample are all rows of the dataframe (not spreadsheet), this should work. Here I'm creating a 2-level MultiIndex containing the 2 first rows, then dropping those rows, and doing 2 melts.

df.columns = pd.MultiIndex.from_arrays([df.iloc[0], df.iloc[1]])
df = df.iloc[2:] # trim off the first 2 rows after we put them in the MultiIndex

tmp = df.melt(id_vars=['ID', 'Name'], col_level=1)
tmp['x'] = df.melt(col_level=0).dropna()[0].to_numpy()
tmp = tmp.rename({1: 'y', 'value': 'z'}, axis=1).set_index(['ID', 'Name'])[['x', 'y', 'z']].sort_index()

Output:

>>> tmp
             x     y  z
ID Name                
1  One   jan20  GrpA  1
   One   jan20  GrpB  2
   One   feb20  GrpA  3
   One   feb20  GrpB  4
2  Two   jan20  GrpA  5
   Two   jan20  GrpB  6
   Two   feb20  GrpA  7
   Two   feb20  GrpB  8

CodePudding user response:

Let us load the file correctly

df = pd.read_excel('yourfile.xlsx', 
                   header=[0,1], 
                   index_col=[0,1])

Then you can just do stack twice

out = df.stack().stack()

CodePudding user response:

A crude way can be to manually reshape:

(df
 .set_index([0,1]).T
 .set_index([(float('nan'), float('nan')), ('ID', 'Name')])
 .rename_axis(index=[None,None], columns=[None,None])
 .stack()
 .stack()
)

Output:

jan20  GrpA  One  1    1
             Two  2    5
       GrpB  One  1    2
             Two  2    6
feb20  GrpA  One  1    3
             Two  2    7
       GrpB  One  1    4
             Two  2    8
dtype: object
  • Related