How to manipulate this data frame in pandas python-CodePudding

I am trying the take values from a data frame and make it into another. It is hard to explain what I am doing but I have made an example below. Please can someone help as I have lots of columns of I would like to reduce to a few. I want to end up with matrix pd.concat([df1,df2]). from df. Pre is a factor with 2 levels, 0forPOST or 1 for PRE, SPC is a factor with many levels. Thank you

df = pd.DataFrame({'CPDID': {0: 'C1', 1: 'C2', 2: 'C3'},
                   'Rate': {0: 100, 1: 500, 2: 200},
                   'PRE_SPC1': {0: 'NaN', 1: 'NaN', 2: 'NaN'},
                   'POST_SPC2': {0:10, 1:50, 2:80},
                   'POST_SPC3': {0:30, 1:40, 2:10}})


df1 = pd.DataFrame({'CPDID':{0: 'C1', 1: 'C2', 2: 'C3'},
                    'Rate': {0: 100, 1: 500, 2: 200},
                    'PRE': {0: 1, 1: 1, 2: 1},
                    'SPC': {0:1, 1:1, 2:1},
                    'Damage': {0:'NaN', 1:'NaN', 2:'NaN'}})

df2 = pd.DataFrame({'CPDID':{0: 'C1', 1: 'C2', 2: 'C2'},
                    'Rate': {0: 100, 1: 500, 2: 200},
                    'PRE': {0: 0, 1: 0, 2: 0},
                    'SPC': {0:2, 1:2, 2:2},
                    'Damage': {0:10, 1:50, 2:80}})

print(df)
print(pd.concat([df1,df2]))

print(df)
print(pd.concat([df1,df2]))

CodePudding user response：

The core step is to transform the dataframe by .stack(). However, your desired dataframe requires quite a few steps to transform and extract column label values from the base df, as follows:

df = pd.DataFrame({'CPDID': {0: 'C1', 1: 'C2', 2: 'C3'},
                   'Rate': {0: 100, 1: 500, 2: 200},
                   'PRE_SPC1': {0: 'NaN', 1: 'NaN', 2: 'NaN'},
                   'POST_SPC2': {0:10, 1:50, 2:80},
                   'POST_SPC3': {0:30, 1:40, 2:10}})

df_out = df.set_index(['CPDID', 'Rate'])

# split 'PRE'/'POST' from 'SPCn' from column labels
df_out.columns = df_out.columns.str.split('_', expand=True)

# prepare for column name `PRE', 'SPC' for the related columns
df_out = df_out.rename_axis(('PRE', 'SPC'),  axis=1)

# main step to transform df by stacking and name the values as 'Damage'
df_out = df_out.stack(level=[0,1]).reset_index(name='Damage')

# transform the values of 'PRE'
df_out['PRE'] = df_out['PRE'].eq('PRE').astype(int)

# extract number from 'SPCn'
df_out['SPC'] = df_out['SPC'].str.extract(r'(\d)$')

# sort to the required sequence
df_out = df_out.sort_values('SPC', ignore_index=True)

Result:

print(df_out)

  CPDID  Rate  PRE SPC Damage
0    C1   100    1   1    NaN
1    C2   500    1   1    NaN
2    C3   200    1   1    NaN
3    C1   100    0   2   10.0
4    C2   500    0   2   50.0
5    C3   200    0   2   80.0
6    C1   100    0   3   30.0
7    C2   500    0   3   40.0
8    C3   200    0   3   10.0