I have a pd DataFrame df
with the following format:
model auc p r
`a-num5-run1` 0.9 0.8 1.0
`a-num5-run2` 0.8 0.7 0.9
`b-num5-run1` 0.7 0.6 0.8
`b-num5-run2` 0.6 0.5 0.7
`a-num10-run1` 0.5 0.4 0.6
`a-num10-run2` 0.4 0.3 0.5
`b-num10-run1` 0.3 0.2 0.4
`b-num10-run2` 0.2 0.1 0.3
....
`a-num100-run1` 0.8 0.9 0.7
`a-num100-run2` 0.6 0.7 0.4
`a-num100-run1` 0.4 0.5 0.1
`a-num100-run2` 0.2 0.3 0.8
The model
column shows the dimensions in which each model can be distinguished.
Now, I would like to create a DataFrame in which the values per column are averaged over their runs and stored in a tuple, each number is a column and each row is a model (a or b in this case). The desired result would be the matrix as shown below:
model_name 5 10 ... 100
a (0.85, 0.75, 0.95) (0.45, 0.35, 0.55) ... (0.7, 0.8, 0.55)
b (0.65, 0.55, 0.75) (0.25, 0.15, 0.35) ... (0.3, 0.4, 0.45)
How can I do this?
CodePudding user response:
First split column to helper DataFrame
by Series.str.split
, then use DataFrame.pivot_table
with extract integrers by Series.str.extract
with default mean
and last create tuples:
df1 = df['model'].str.split('-', expand=True)
df = (df.pivot_table(index=df1[0],
columns=df1[1].str.extract('(\d )', expand=False).astype(int),
values=['auc','p','r'], fill_value=0)
.round(2)
.T
.groupby(level=1)
.agg(tuple)
.T)
print (df)
1 5 10 100
0
a (0.85, 0.75, 0.95) (0.45, 0.35, 0.55) (0.5, 0.6, 0.5)
b (0.65, 0.55, 0.75) (0.25, 0.15, 0.35) (0.0, 0.0, 0.0)
Or:
df1 = df['model'].str.split('-', expand=True)
df = (df.pivot_table(index=df1[0],
columns=df1[1].str.extract('(\d )', expand=False).astype(int),
values=['auc','p','r'],
fill_value=0)
.round(2)
.groupby(level=1, axis=1)
.apply(lambda x: pd.Series((x.itertuples(index=False, name=None)), name=x.name))
)
print (df)
1 5 10 100
0 (0.85, 0.75, 0.95) (0.45, 0.35, 0.55) (0.5, 0.6, 0.5)
1 (0.65, 0.55, 0.75) (0.25, 0.15, 0.35) (0.0, 0.0, 0.0)