I have this data frame:
df = pd.DataFrame({ "cluster" : ["A", "A", "A", "A", "A", "C", "C", "C", "C"],
"col1": ["0", "0", "1", "1", "0", "1", "0", "1", "0"],
"col2": ["1", "1", "1", "1", "0", "1", "1", "1", "0"] })
and I can't figure out the best way to have a data frame result with this format
col | cluster | avg |
---|---|---|
col1 | A | 0.4 |
col1 | C | 0.5 |
col2 | A | 0.8 |
col2 | C | 0.75 |
I tried using transpose, then melt the cluster, but didn't work. Any help would be highly appreciated!
My desired result is the table - where all columns (except cluster) are rows, and cluster is melt and the average is calculated for the values of the column for each cluster
CodePudding user response:
Looks like a combination of melt
and groupby
mean
would do the trick. Note that the "col" values are strings in your input, so we also have to convert them to integers first before calculating the mean:
out = (df.melt(['cluster'], var_name='col')
.assign(value=lambda x: x['value'].astype(int))
.groupby(['col', 'cluster'], as_index=False).mean())
Output:
col cluster value
0 col1 A 0.40
1 col1 C 0.50
2 col2 A 0.80
3 col2 C 0.75
CodePudding user response:
You can group by cluster
column and calculate mean of each group then melt
df[['col1', 'col2']] = df[['col1', 'col2']].astype(int)
out = (df.groupby('cluster')
.agg('mean').reset_index()
.melt(id_vars=['cluster'], var_name='col', value_name='avg')
)
print(out)
cluster col avg
0 A col1 0.40
1 C col1 0.50
2 A col2 0.80
3 C col2 0.75