Transpose or melt to group by columns?-CodePudding

I have this data frame:

df = pd.DataFrame({ "cluster" : ["A", "A", "A", "A", "A", "C", "C", "C", "C"], 
                    "col1": ["0", "0", "1", "1", "0", "1", "0", "1", "0"], 
                    "col2": ["1", "1", "1", "1", "0", "1", "1", "1", "0"] })

and I can't figure out the best way to have a data frame result with this format

col	cluster	avg
col1	A	0.4
col1	C	0.5
col2	A	0.8
col2	C	0.75

I tried using transpose, then melt the cluster, but didn't work. Any help would be highly appreciated!

My desired result is the table - where all columns (except cluster) are rows, and cluster is melt and the average is calculated for the values of the column for each cluster

CodePudding user response：

Looks like a combination of melt and groupby mean would do the trick. Note that the "col" values are strings in your input, so we also have to convert them to integers first before calculating the mean:

out = (df.melt(['cluster'], var_name='col')
       .assign(value=lambda x: x['value'].astype(int))
       .groupby(['col', 'cluster'], as_index=False).mean())

Output:

    col cluster  value
0  col1       A   0.40
1  col1       C   0.50
2  col2       A   0.80
3  col2       C   0.75

CodePudding user response：

You can group by cluster column and calculate mean of each group then melt

df[['col1', 'col2']] = df[['col1', 'col2']].astype(int)

out = (df.groupby('cluster')
         .agg('mean').reset_index()
         .melt(id_vars=['cluster'], var_name='col', value_name='avg')
)

print(out)

  cluster   col   avg
0       A  col1  0.40
1       C  col1  0.50
2       A  col2  0.80
3       C  col2  0.75