I have a pandas DataFrame that looks like the following and contains pieces of information that I would like to separate out from column headers and the data rows:
Type1_100_A Type2_200_B Seq
0 34.0 NaN 2_CTGT
1 4573.0 NaN 7_GATG
2 16.0 NaN 4_ACTT
3 NaN 17.0 5_GTCA
4 NaN 25.0 1_TCGA
Code:
pd.DataFrame({
'Type1_3_A': {0: np.nan, 1: np.nan, 2: np.nan, 3: 17.0, 4: 25.0},
'Type2_3_B': {0: 34.0, 1: 4573.0, 2: 16.0, 3: np.nan, 4: np.nan},
'Seq': {
0: '2_CTGT',
1: '7_GATG',
2: '4_ACTT',
3: '5_GTCA',
4: '1_TCGA'
}
})
I would like to rearrange the DataFrame, using substrings of the data and column headers to look like this:
Type Label Replicate Count Seq Number
0 Type1 100 A 2 CTGT 34.0
1 Type1 100 A 7 GATG 4573.0
2 Type1 100 A 4 ACTT 16.0
3 Type2 200 B 5 GTCA 17.0
4 Type2 200 B 1 TCGA 25.0
Code:
pd.DataFrame({
'Type': {0: 'Type1', 1: 'Type1', 2: 'Type1', 3: 'Type2', 4: 'Type2'},
'Label': {0: 100, 1: 100, 2: 100, 3: 200, 4: 200},
'Replicate': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B'},
'Count': {0: 2, 1: 7, 2: 4, 3: 5, 4: 1},
'Seq': {0: 'CTGT', 1: 'GATG', 2: 'ACTT', 3: 'GTCA', 4: 'TCGA'},
'Number': {0: 34.0, 1: 4573.0, 2: 16.0, 3: 17.0, 4: 25.0},
})
CodePudding user response:
Use:
#convert index to Seq
df1 = df.set_index('Seq')
#get sum if only one non NaN per rows
v = df1.sum(axis=1)
#repeat non NaNs by columns names
df1 = df1.notna().dot(df1.columns).reset_index(name='Type')
#split columns
df1[['Count','Seq']] = df1['Seq'].str.split('_', expand=True)
df1[['Type','Label','Replicate']] = df1['Type'].str.split('_', expand=True)
#set new columns
df1['Number'] = v.to_numpy()
#change order of columns
df1 = df1[['Type', 'Label', 'Replicate', 'Count', 'Seq', 'Number']]
print (df1)
Type Label Replicate Count Seq Number
0 Type2 3 B 2 CTGT 34.0
1 Type2 3 B 7 GATG 4573.0
2 Type2 3 B 4 ACTT 16.0
3 Type1 3 A 5 GTCA 17.0
4 Type1 3 A 1 TCGA 25.0