I've scraped from data from excel that has more than one header row of data, with some nans in one row:
df = pd.read_excel(file, header=None)
0 nan nan jan20 jan20 feb20 feb20
1 ID Name GrpA GrpB GrpA GrpB
2 1 One 1 2 3 4
3 2 Two 5 6 7 8
Each ID/Name pair has 4 pieces of data it needs to be associated with. The goal for the data is to look like:
ID Name
1 One jan20 grpA 1
1 One jan20 grpB 2
1 One feb20 grpA 3
1 One feb20 grpB 4
2 Two jan20 grpA 5
2 Two jan20 grpB 6
2 Two feb20 grpA 7
2 Two feb20 grpB 8
Note when this is scraped, I don't indicate the header rows upon import, as there is some cleaning of the data to do. I've tried lots of various combinations of pd.melt, pd.stack, pd.unstack, but not getting there. I've tried making the first two rows both header rows, but then I have problems with the keys.
It seems like .melt might be the right answer because you can choose a col_level and could name the resulting columns, but when I try that route I can't access the dates (first row) it seems
df.melt(col_level=1, id_vars=['ID'])
Any ideas on this one?
CodePudding user response:
Assuming the lines of your sample are all rows of the dataframe (not spreadsheet), this should work. Here I'm creating a 2-level MultiIndex containing the 2 first rows, then dropping those rows, and doing 2 melt
s.
df.columns = pd.MultiIndex.from_arrays([df.iloc[0], df.iloc[1]])
df = df.iloc[2:] # trim off the first 2 rows after we put them in the MultiIndex
tmp = df.melt(id_vars=['ID', 'Name'], col_level=1)
tmp['x'] = df.melt(col_level=0).dropna()[0].to_numpy()
tmp = tmp.rename({1: 'y', 'value': 'z'}, axis=1).set_index(['ID', 'Name'])[['x', 'y', 'z']].sort_index()
Output:
>>> tmp
x y z
ID Name
1 One jan20 GrpA 1
One jan20 GrpB 2
One feb20 GrpA 3
One feb20 GrpB 4
2 Two jan20 GrpA 5
Two jan20 GrpB 6
Two feb20 GrpA 7
Two feb20 GrpB 8
CodePudding user response:
Let us load the file correctly
df = pd.read_excel('yourfile.xlsx',
header=[0,1],
index_col=[0,1])
Then you can just do stack
twice
out = df.stack().stack()
CodePudding user response:
A crude way can be to manually reshape:
(df
.set_index([0,1]).T
.set_index([(float('nan'), float('nan')), ('ID', 'Name')])
.rename_axis(index=[None,None], columns=[None,None])
.stack()
.stack()
)
Output:
jan20 GrpA One 1 1
Two 2 5
GrpB One 1 2
Two 2 6
feb20 GrpA One 1 3
Two 2 7
GrpB One 1 4
Two 2 8
dtype: object