I am trying the take values from a data frame and make it into another. It is hard to explain what I am doing but I have made an example below. Please can someone help as I have lots of columns of I would like to reduce to a few. I want to end up with matrix pd.concat([df1,df2])
. from df
.
Pre
is a factor with 2 levels, 0
forPOST
or 1
for PRE
, SPC
is a factor with many levels.
Thank you
df = pd.DataFrame({'CPDID': {0: 'C1', 1: 'C2', 2: 'C3'},
'Rate': {0: 100, 1: 500, 2: 200},
'PRE_SPC1': {0: 'NaN', 1: 'NaN', 2: 'NaN'},
'POST_SPC2': {0:10, 1:50, 2:80},
'POST_SPC3': {0:30, 1:40, 2:10}})
df1 = pd.DataFrame({'CPDID':{0: 'C1', 1: 'C2', 2: 'C3'},
'Rate': {0: 100, 1: 500, 2: 200},
'PRE': {0: 1, 1: 1, 2: 1},
'SPC': {0:1, 1:1, 2:1},
'Damage': {0:'NaN', 1:'NaN', 2:'NaN'}})
df2 = pd.DataFrame({'CPDID':{0: 'C1', 1: 'C2', 2: 'C2'},
'Rate': {0: 100, 1: 500, 2: 200},
'PRE': {0: 0, 1: 0, 2: 0},
'SPC': {0:2, 1:2, 2:2},
'Damage': {0:10, 1:50, 2:80}})
print(df)
print(pd.concat([df1,df2]))
print(df)
print(pd.concat([df1,df2]))
CodePudding user response:
The core step is to transform the dataframe by .stack()
. However, your desired dataframe requires quite a few steps to transform and extract column label values from the base df
, as follows:
df = pd.DataFrame({'CPDID': {0: 'C1', 1: 'C2', 2: 'C3'},
'Rate': {0: 100, 1: 500, 2: 200},
'PRE_SPC1': {0: 'NaN', 1: 'NaN', 2: 'NaN'},
'POST_SPC2': {0:10, 1:50, 2:80},
'POST_SPC3': {0:30, 1:40, 2:10}})
df_out = df.set_index(['CPDID', 'Rate'])
# split 'PRE'/'POST' from 'SPCn' from column labels
df_out.columns = df_out.columns.str.split('_', expand=True)
# prepare for column name `PRE', 'SPC' for the related columns
df_out = df_out.rename_axis(('PRE', 'SPC'), axis=1)
# main step to transform df by stacking and name the values as 'Damage'
df_out = df_out.stack(level=[0,1]).reset_index(name='Damage')
# transform the values of 'PRE'
df_out['PRE'] = df_out['PRE'].eq('PRE').astype(int)
# extract number from 'SPCn'
df_out['SPC'] = df_out['SPC'].str.extract(r'(\d)$')
# sort to the required sequence
df_out = df_out.sort_values('SPC', ignore_index=True)
Result:
print(df_out)
CPDID Rate PRE SPC Damage
0 C1 100 1 1 NaN
1 C2 500 1 1 NaN
2 C3 200 1 1 NaN
3 C1 100 0 2 10.0
4 C2 500 0 2 50.0
5 C3 200 0 2 80.0
6 C1 100 0 3 30.0
7 C2 500 0 3 40.0
8 C3 200 0 3 10.0