Pivot dataframe into one level / rename df.pivot() output columns-CodePudding

I have a dataframe, df_res with volumetric and mean and standard deviation jacobian values against ID strings (subjid), software type ran (pipeline), and anatomical region labels (label_id):

df_res = 

       subjid   pipeline   label_id    volume_(mm^3)    mean_jacobian    stdev_jacobian
0   100007_t0      Rigid          0       10100000.0             1.11             0.078
1   100007_t0      Rigid          1         315439.0          1.04635             0.283
2   100007_t0      Rigid          2         624165.0         0.968231             0.192
3   100007_t0      Rigid          3         515589.0           1.1273             0.229
4   100007_t1      Rigid          0       10084600.0           1.0935             0.033
5   100007_t1      Rigid          1         320533.0           1.0457             0.277
6   100007_t1      Rigid          2         621393.0            0.957             0.193
7   100007_t1      Rigid          3         507840.0          1.00573             0.232

My goal dataframe is something where I would want to have one row per subjid-pipeline pair, and the label_id is almost transposed as column per quantity.

The preview below is a "condensed version" for formatting's sake.

Assume that each label_id from df_res has its own {label_id}_volume_(mm^3), {label_id}_mean_jacobian, and {label_id}_stdev_jacobian columns (total of 9 for each label_id pipeline subjid = 11 total columns):

df_goal = 

subjid      pipeline    label0_volume_(mm^3)    ...   label3_volume_(mm^3)    ...   label3_mean_jacobian
100007_t0      Rigid              10100000.0                      515589.0          1.1273
100007_t1      Rigid              10084600.0                      507840.0         1.00573

I managed to get something close, with df_res_test. I did a df.pivot() function:

>>> df_res_pivot = df_res.pivot(index="subjid", columns="label_id", values=["volume_(mm^3)", "mean_jacobian", "stdev_jacobian"])

df_res_pivot = 

          volume_(mm^3)                      ... stdev_jacobian                    
label_id              0         1         2  ...              1         2         3
subjid                                       ...                                   
100007_t0    10100000.0  315439.0  624165.0  ...       0.289318  0.192214  0.229341
100007_t1    10084600.0  320533.0  621393.0  ...       0.277735  0.193940  0.232486

[2 rows x 12 columns]

Is there a way I can rename / combine the values and columns argument to make my data more sensible and look like df_goal?

CodePudding user response：

@AndrejKesely's solution is great. Another option is with pivot_wider from pyjanitor, using the names_glue parameter:

# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_wider(
     index = ['subjid', 'pipeline'], 
     names_from = 'label_id', 
     flatten_levels=True, 
     names_glue = "label{label_id}_{_value}")
)
      subjid pipeline  label0_volume_(mm^3)  label1_volume_(mm^3)  ...  label0_stdev_jacobian  label1_stdev_jacobian  label2_stdev_jacobian  label3_stdev_jacobian
0  100007_t0    Rigid            10100000.0              315439.0  ...                  0.078                  0.283                  0.192                  0.229
1  100007_t1    Rigid            10084600.0              320533.0  ...                  0.033                  0.277                  0.193                  0.232

CodePudding user response：

Try:

df = df.pivot(index=["subjid", "pipeline"], columns=["label_id"])
df.columns = [f"label{b}_{a}" for a, b in df.columns]

print(df.reset_index())

Prints:

      subjid pipeline  label0_volume_(mm^3)  label1_volume_(mm^3)  label2_volume_(mm^3)  label3_volume_(mm^3)  label0_mean_jacobian  label1_mean_jacobian  label2_mean_jacobian  label3_mean_jacobian  label0_stdev_jacobian  label1_stdev_jacobian  label2_stdev_jacobian  label3_stdev_jacobian
0  100007_t0    Rigid            10100000.0              315439.0              624165.0              515589.0                1.1100               1.04635              0.968231               1.12730                  0.078                  0.283                  0.192                  0.229
1  100007_t1    Rigid            10084600.0              320533.0              621393.0              507840.0                1.0935               1.04570              0.957000               1.00573                  0.033                  0.277                  0.193                  0.232

EDIT: With the dataframe df_res_pivot you have in your question:

df_res_pivot = df.pivot(
    index="subjid",
    columns="label_id",
    values=["volume_(mm^3)", "mean_jacobian", "stdev_jacobian"],
)

df_res_pivot.columns = [f"label{b}_{a}" for a, b in df_res_pivot.columns]

print(df_res_pivot.reset_index())

Prints:

      subjid  label0_volume_(mm^3)  label1_volume_(mm^3)  label2_volume_(mm^3)  label3_volume_(mm^3)  label0_mean_jacobian  label1_mean_jacobian  label2_mean_jacobian  label3_mean_jacobian  label0_stdev_jacobian  label1_stdev_jacobian  label2_stdev_jacobian  label3_stdev_jacobian
0  100007_t0            10100000.0              315439.0              624165.0              515589.0                1.1100               1.04635              0.968231               1.12730                  0.078                  0.283                  0.192                  0.229
1  100007_t1            10084600.0              320533.0              621393.0              507840.0                1.0935               1.04570              0.957000               1.00573                  0.033                  0.277                  0.193                  0.232