Long time listener; First time caller :)
I have the following dataframe:
df = pd.DataFrame({'a':['a1','a1','a1','a2','a2','a3','a3','a3','a3'],
'cat1':['C','D','D','C','D','D','D','D','C'],
'cat2':['X','Y','X','Y','X','Y','X','Y','X'],
'is':[1,2,3,1,2,1,2,3,4],
'fs':[.1,.2,.3,.1,.2,.1,.2,.3,.4]})
giving:
a cat1 cat2 is fs
0 a1 C X 1 0.1
1 a1 D Y 2 0.2
2 a1 D X 3 0.3
3 a2 C Y 1 0.1
4 a2 D X 2 0.2
5 a3 D Y 1 0.1
6 a3 D X 2 0.2
7 a3 D Y 3 0.3
8 a3 C X 4 0.4
I am looking to group the dataframe in such a way that it gives the below output, where each row is first grouped by the value in column a
, then the numeric columns of the dataframe are aggregated (sum, mean, etc) based on category -- cat1
and cat2
.
a sum(is|cat1=C) sum(fs|cat1=C) sum(is|cat1=D) sum(fs|cat1=D) sum(is|cat2=X) sum(fs|cat2=X) sum(is|cat2=Y) sum(fs|cat2=Y)
0 a1 1 0.1 5 0.5 4 0.4 2 0.2
1 a2 1 0.1 2 0.2 2 0.2 1 0.1
2 a3 4 0.4 6 0.6 6 0.6 4 0.4
For example, in row 0 above, sum(is|cat1=D)
is the sum of is
values for a1
when cat1 = D
. Hope this is clear enough. Happy to further clarify.
Here's the best I'm able to do so far:
output = []
aggregations = ['sum']
for category in ['cat1', 'cat2']:
num_data = df.select_dtypes('number')
grouped = df.groupby(['a', category]).agg({'is':'sum', 'fs':'sum'})
flattened = pd.DataFrame(grouped.to_records())
flattened.columns = ['a', category] ['%s(%s|%s)' % (op, column, category)
for column in list(num_data.columns) for op in aggregations]
output.append(flattened)
output[0] looks something like this:
a cat1 sum(is|cat1) sum(fs|cat1)
0 a1 C 1 0.1
1 a1 D 5 0.5
2 a2 C 1 0.1
3 a2 D 2 0.2
4 a3 C 4 0.4
5 a3 D 6 0.6
which is obviously some ways off. I can't figure out how to further flatten it and rename the sum headers appropriately. Also, I don't know how to include cat2
-based aggregation in the results.
Thanks!
CodePudding user response:
Use DataFrame.melt
before aggregating, then add variable,value
to groupby
, then reshape by DataFrame.unstack
, sorting third level of MultiIndex in columns
and flattening it:
categories = ['cat1', 'cat2']
df1 = (df.melt(id_vars=df.columns.difference(categories))
.groupby(['a','variable','value'])
.agg({'is':'sum', 'fs':'sum'})
.unstack([1,2])
.sort_index(axis=1, level=2, sort_remaining=False))
df1.columns = [f'sum({a}|{b}={c})' for a,b,c in df1.columns]
df1 = df1.reset_index()
print (df1)
a sum(is|cat1=C) sum(fs|cat1=C) sum(is|cat1=D) sum(fs|cat1=D) \
0 a1 1 0.1 5 0.5
1 a2 1 0.1 2 0.2
2 a3 4 0.4 6 0.6
sum(is|cat2=X) sum(fs|cat2=X) sum(is|cat2=Y) sum(fs|cat2=Y)
0 4 0.4 2 0.2
1 2 0.2 1 0.1
2 6 0.6 4 0.4
CodePudding user response:
Slightly different approach:
Group the dataframe on a
, and category column, calculate the sum
as agregate, add suffix, and unstack the last level i.e. -1
, then finally rename the columns. Do the same for both the columns, finally join the dataframes.
cat1=df.groupby(['a', 'cat1']).agg(sum).add_suffix('|cat1').unstack(-1)
cat1.columns=[f"sum({x}={y})" for x,y in cat1.columns]
cat2=df.groupby(['a', 'cat2']).agg(sum).add_suffix('|cat2').unstack(-1)
cat2.columns=[f"sum({x}={y})" for x,y in cat2.columns]
pd.concat([cat1, cat2], axis=1)
OUTPUT:
sum(is|cat1=C) sum(is|cat1=D) sum(fs|cat1=C) sum(fs|cat1=D) \
a
a1 1 5 0.1 0.5
a2 1 2 0.1 0.2
a3 4 6 0.4 0.6
sum(is|cat2=X) sum(is|cat2=Y) sum(fs|cat2=X) sum(fs|cat2=Y)
a
a1 4 2 0.4 0.2
a2 2 1 0.2 0.1
a3 6 4 0.6 0.4