Home > Enterprise >  Transform pandas DataFrame based on cell information
Transform pandas DataFrame based on cell information

Time:11-13

I have a pd DataFrame df with the following format:

model              auc            p             r
`a-num5-run1`      0.9            0.8           1.0
`a-num5-run2`      0.8            0.7           0.9
`b-num5-run1`      0.7            0.6           0.8
`b-num5-run2`      0.6            0.5           0.7
`a-num10-run1`     0.5            0.4           0.6
`a-num10-run2`     0.4            0.3           0.5
`b-num10-run1`     0.3            0.2           0.4
`b-num10-run2`     0.2            0.1           0.3
.... 
`a-num100-run1`     0.8            0.9           0.7
`a-num100-run2`     0.6            0.7           0.4
`a-num100-run1`     0.4            0.5           0.1
`a-num100-run2`     0.2            0.3           0.8

The model column shows the dimensions in which each model can be distinguished. Now, I would like to create a DataFrame in which the values per column are averaged over their runs and stored in a tuple, each number is a column and each row is a model (a or b in this case). The desired result would be the matrix as shown below:

model_name     5                         10                   ...   100
a              (0.85, 0.75, 0.95)        (0.45, 0.35, 0.55)   ...   (0.7, 0.8, 0.55)
b              (0.65, 0.55, 0.75)        (0.25, 0.15, 0.35)   ...   (0.3, 0.4, 0.45)

How can I do this?

CodePudding user response:

First split column to helper DataFrame by Series.str.split, then use DataFrame.pivot_table with extract integrers by Series.str.extract with default mean and last create tuples:

df1 = df['model'].str.split('-', expand=True)

df = (df.pivot_table(index=df1[0], 
                    columns=df1[1].str.extract('(\d )', expand=False).astype(int), 
                    values=['auc','p','r'], fill_value=0)
       .round(2)
       .T
       .groupby(level=1)
       .agg(tuple)
       .T)
print (df)
1                 5                   10               100
0                                                         
a  (0.85, 0.75, 0.95)  (0.45, 0.35, 0.55)  (0.5, 0.6, 0.5)
b  (0.65, 0.55, 0.75)  (0.25, 0.15, 0.35)  (0.0, 0.0, 0.0)

Or:

df1 = df['model'].str.split('-', expand=True)

df = (df.pivot_table(index=df1[0],
                    columns=df1[1].str.extract('(\d )', expand=False).astype(int),
                    values=['auc','p','r'], 
                    fill_value=0)
        .round(2)
        .groupby(level=1, axis=1)
        .apply(lambda x: pd.Series((x.itertuples(index=False, name=None)), name=x.name))
        )
print (df)
1                 5                   10               100
0  (0.85, 0.75, 0.95)  (0.45, 0.35, 0.55)  (0.5, 0.6, 0.5)
1  (0.65, 0.55, 0.75)  (0.25, 0.15, 0.35)  (0.0, 0.0, 0.0)
  • Related