Renaming columns using the first part of the string in the first row of a dataframe-CodePudding

I imported a dataset with different column names that all start with Area, Mean, StdDev but have () with different strings inside. What I want to do is split the string in each column by the "( )" and set the first part as the column headers and the second part as a row under that header. I was looking into stacking and unstacking similar to this link or .split() and .explode() in this link but I don't know how to apply it to mine. It is slightly different. My initial dataframe looks like df1. What I am trying to do is df2 but at the end, I am trying to get a format like df3. If you can help me getting to either df2 or df3 that would be great, otherwise I can keep playing around with it after I mange to get it to look like df2.

My dataset looks like:

df1 = pd.DataFrame({
1: {'1': "Area(E10_1_nucleus)", '2': 435}, 
2: {'1': "Mean(E10_1_nucleus)", '2': 313},
3: {'1': "StdDev(E10_1_nucleus)", '2': 150}, 
4: {'1': "Area(E10_1_cytoplasm)", '2': 635},
5: {'1': "Mean(E10_1_cytoplasm)", '2': 847}, 
6: {'1': "StdDev(E10_1_cytoplasm)", '2': 321}})

I am trying to do the following:

df2= pd.DataFrame({
1: {'1': "Area", '2':'E10_1_nucleus','3': 435}, 
2: {'1': "Mean",'2':'E10_1_nucleus' ,'3': 313},
3: {'1': "StdDev",'2':'E10_1_nucleus' ,'3': 150}, 
4: {'1': "Area",'2':'E10_1_cytoplasm' ,'3': 635},
5: {'1': "Mean",'2':'E10_1_cytoplasm' ,'3': 847}, 
6: {'1': "StdDev",'2':'E10_1_cytoplasm' ,'3': 331}})

The final desired format is the following:

df3= pd.DataFrame({
'Label': {'1':'E10_1_nucleus' ,'2':'E10_1_cytoplasm' },
'Area': {'1': 435,'2':635}, 
'Mean': {'1': 313,'2':847},
'StdDev': {'1': 150,'2':331}})

CodePudding user response：

Using pivot

df = df.transpose()
df = (
    df
    .assign(
        Aggregate=df["1"].str.replace(r"\(.*\)", "", regex=True),
        Label=df.pop("1").str.extract(r"\((.*?)\)", expand=True)
    )
    .pivot(index="Label", columns="Aggregate", values="2")
    .astype(int)
    .reset_index()
    .rename_axis(None, axis=1)
)

print(df)

             Label Area Mean StdDev
0  E10_1_cytoplasm  635  847    321
1    E10_1_nucleus  435  313    150