Reformat Pandas DataFrame Using Substring Information-CodePudding

I have a pandas DataFrame that looks like the following and contains pieces of information that I would like to separate out from column headers and the data rows:

    Type1_100_A Type2_200_B Seq
0   34.0        NaN         2_CTGT
1   4573.0      NaN         7_GATG
2   16.0        NaN         4_ACTT
3   NaN         17.0        5_GTCA
4   NaN         25.0        1_TCGA

Code:

pd.DataFrame({
'Type1_3_A': {0: np.nan, 1: np.nan, 2: np.nan, 3: 17.0, 4: 25.0},
'Type2_3_B': {0: 34.0, 1: 4573.0, 2: 16.0, 3: np.nan, 4: np.nan},
'Seq': {
0: '2_CTGT',
1: '7_GATG',
2: '4_ACTT',
3: '5_GTCA',
4: '1_TCGA'
}
})

I would like to rearrange the DataFrame, using substrings of the data and column headers to look like this:

    Type    Label   Replicate   Count   Seq     Number
0   Type1   100     A           2       CTGT    34.0
1   Type1   100     A           7       GATG    4573.0
2   Type1   100     A           4       ACTT    16.0
3   Type2   200     B           5       GTCA    17.0
4   Type2   200     B           1       TCGA    25.0

Code:

pd.DataFrame({
    'Type': {0: 'Type1', 1: 'Type1', 2: 'Type1', 3: 'Type2', 4: 'Type2'},
    'Label': {0: 100, 1: 100, 2: 100, 3: 200, 4: 200},
    'Replicate': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B'},
    'Count': {0: 2, 1: 7, 2: 4, 3: 5, 4: 1},
    'Seq': {0: 'CTGT', 1: 'GATG', 2: 'ACTT', 3: 'GTCA', 4: 'TCGA'},
    'Number': {0: 34.0, 1: 4573.0, 2: 16.0, 3: 17.0, 4: 25.0},
    })

CodePudding user response：

Use:

#convert index to Seq
df1 = df.set_index('Seq')
#get sum if only one non NaN per rows
v = df1.sum(axis=1)

#repeat non NaNs by columns names
df1 = df1.notna().dot(df1.columns).reset_index(name='Type')
#split columns
df1[['Count','Seq']] = df1['Seq'].str.split('_', expand=True)
df1[['Type','Label','Replicate']] = df1['Type'].str.split('_', expand=True)
#set new columns
df1['Number'] = v.to_numpy()

#change order of columns
df1 = df1[['Type', 'Label', 'Replicate', 'Count', 'Seq', 'Number']]
print (df1) 
    Type Label Replicate Count   Seq  Number
0  Type2     3         B     2  CTGT    34.0
1  Type2     3         B     7  GATG  4573.0
2  Type2     3         B     4  ACTT    16.0
3  Type1     3         A     5  GTCA    17.0
4  Type1     3         A     1  TCGA    25.0