I imported a dataset with different column names that all start with Area, Mean, StdDev but have () with different strings inside. What I want to do is split the string in each column by the "( )" and set the first part as the column headers and the second part as a row under that header. I was looking into stacking and unstacking similar to this link or .split() and .explode() in this link but I don't know how to apply it to mine. It is slightly different. My initial dataframe looks like df1. What I am trying to do is df2 but at the end, I am trying to get a format like df3. If you can help me getting to either df2 or df3 that would be great, otherwise I can keep playing around with it after I mange to get it to look like df2.
My dataset looks like:
df1 = pd.DataFrame({
1: {'1': "Area(E10_1_nucleus)", '2': 435},
2: {'1': "Mean(E10_1_nucleus)", '2': 313},
3: {'1': "StdDev(E10_1_nucleus)", '2': 150},
4: {'1': "Area(E10_1_cytoplasm)", '2': 635},
5: {'1': "Mean(E10_1_cytoplasm)", '2': 847},
6: {'1': "StdDev(E10_1_cytoplasm)", '2': 321}})
I am trying to do the following:
df2= pd.DataFrame({
1: {'1': "Area", '2':'E10_1_nucleus','3': 435},
2: {'1': "Mean",'2':'E10_1_nucleus' ,'3': 313},
3: {'1': "StdDev",'2':'E10_1_nucleus' ,'3': 150},
4: {'1': "Area",'2':'E10_1_cytoplasm' ,'3': 635},
5: {'1': "Mean",'2':'E10_1_cytoplasm' ,'3': 847},
6: {'1': "StdDev",'2':'E10_1_cytoplasm' ,'3': 331}})
The final desired format is the following:
df3= pd.DataFrame({
'Label': {'1':'E10_1_nucleus' ,'2':'E10_1_cytoplasm' },
'Area': {'1': 435,'2':635},
'Mean': {'1': 313,'2':847},
'StdDev': {'1': 150,'2':331}})
CodePudding user response:
Using pivot
df = df.transpose()
df = (
df
.assign(
Aggregate=df["1"].str.replace(r"\(.*\)", "", regex=True),
Label=df.pop("1").str.extract(r"\((.*?)\)", expand=True)
)
.pivot(index="Label", columns="Aggregate", values="2")
.astype(int)
.reset_index()
.rename_axis(None, axis=1)
)
print(df)
Label Area Mean StdDev
0 E10_1_cytoplasm 635 847 321
1 E10_1_nucleus 435 313 150