I have a data frame looks like this:
{'year1_ID': [Id1, Id2], 'year1_Name':['Name1', 'Name2'], 'year1_Info': ['Info1','Info2'],'year1_AddInfo':['AddInfo1','AddInfo2'],
'year2_ID': [Id3, Id4], 'year2_Name':['Name3', 'Name4'], 'year2_Info': ['Info3','Info4'],'year2_AddInfo':['AddInfo3','AddInfo4'],
'year3_ID': [Id5, Id6], 'year3_Name':['Name5', 'Name6'], 'year3_Info': ['Info5','Info6'],'year3_AddInfo':['AddInfo5','AddInfo6']}
year1 year1 year1 year1 year2 year2 year2 year2 year3 year3 year3 year3
ID Name Info AddInfo Id Name Info AddInfo Id Name Info AddInfo
Id1 Name1 Info1 AddInfo1 Id3 Name3 Info3 AddInfo3 Id5 Name5 Info5 AddInfo5
Id2 Name2 Info2 AddInfo2 Id4 Name4 Info4 AddInfo4 Id6 Name6 Info6 AddInfo6
The ID, Name, Info and AddInfo is basically another headers that repeats across multiple years
I would like to convert this data frame into this:
Year ID Name Info AddInfo
year1 Id1 Name1 Info1 AddInfo1
year1 Id2 Name2 Info2 AddInfo2
year2 Id3 Name3 Info3 AddInfo3
year2 Id4 Name4 Info4 AddInfo4
year3 Id5 Name5 Info5 AddInfo5
year3 Id6 Name6 Info6 AddInfo6
I'm trying pd.melt() and pd.stack() but I haven't able to get it quite right.
CodePudding user response:
Given the provided DataFrame, you can set a MultiIndex and stack
:
(df
.set_axis(pd.MultiIndex.from_arrays(zip(*df.columns.str.split('_', n=1))), axis=1)
.stack(0).rename_axis((None, 'Year'))
.reset_index('Year')
#.reset_index(drop=True) # uncomment to have a clean index
)
output:
Year AddInfo ID Info Name
0 year1 AddInfo1 Id1 Info1 Name1
0 year2 AddInfo3 Id3 Info3 Name3
0 year3 AddInfo5 Id5 Info5 Name5
1 year1 AddInfo2 Id2 Info2 Name2
1 year2 AddInfo4 Id4 Info4 Name4
1 year3 AddInfo6 Id6 Info6 Name6
CodePudding user response:
Here's an easy way:
df = pd.DataFrame(data)
df.columns = df.columns.str.split('_', expand=True)
df.stack(0).sort_index(level=1)
Output:
AddInfo ID Info Name
0 year1 AddInfo1 Id1 Info1 Name1
1 year1 AddInfo2 Id2 Info2 Name2
0 year2 AddInfo3 Id3 Info3 Name3
1 year2 AddInfo4 Id4 Info4 Name4
0 year3 AddInfo5 Id5 Info5 Name5
1 year3 AddInfo6 Id6 Info6 Name6
Removing the 0 and 1 then reset_index gives:
df.stack(0).sort_index(level=1).droplevel(0).reset_index()
Output:
index AddInfo ID Info Name
0 year1 AddInfo1 Id1 Info1 Name1
1 year1 AddInfo2 Id2 Info2 Name2
2 year2 AddInfo3 Id3 Info3 Name3
3 year2 AddInfo4 Id4 Info4 Name4
4 year3 AddInfo5 Id5 Info5 Name5
5 year3 AddInfo6 Id6 Info6 Name6