Home > Enterprise >  Aggregating and flattening a dataframe using multiple columns
Aggregating and flattening a dataframe using multiple columns

Time:09-15

Long time listener; First time caller :)

I have the following dataframe:

df = pd.DataFrame({'a':['a1','a1','a1','a2','a2','a3','a3','a3','a3'],
                   'cat1':['C','D','D','C','D','D','D','D','C'],
                   'cat2':['X','Y','X','Y','X','Y','X','Y','X'],
                   'is':[1,2,3,1,2,1,2,3,4],
                   'fs':[.1,.2,.3,.1,.2,.1,.2,.3,.4]})

giving:

    a cat1 cat2  is   fs
0  a1    C    X   1  0.1
1  a1    D    Y   2  0.2
2  a1    D    X   3  0.3
3  a2    C    Y   1  0.1
4  a2    D    X   2  0.2
5  a3    D    Y   1  0.1
6  a3    D    X   2  0.2
7  a3    D    Y   3  0.3
8  a3    C    X   4  0.4

I am looking to group the dataframe in such a way that it gives the below output, where each row is first grouped by the value in column a, then the numeric columns of the dataframe are aggregated (sum, mean, etc) based on category -- cat1 and cat2.

    a sum(is|cat1=C) sum(fs|cat1=C) sum(is|cat1=D) sum(fs|cat1=D) sum(is|cat2=X)  sum(fs|cat2=X) sum(is|cat2=Y)  sum(fs|cat2=Y)
0  a1              1            0.1             5             0.5              4             0.4              2             0.2
1  a2              1            0.1             2             0.2              2             0.2              1             0.1
2  a3              4            0.4             6             0.6              6             0.6              4             0.4

For example, in row 0 above, sum(is|cat1=D) is the sum of is values for a1 when cat1 = D. Hope this is clear enough. Happy to further clarify.

Here's the best I'm able to do so far:

output = []
aggregations = ['sum']
for category in ['cat1', 'cat2']:
    num_data = df.select_dtypes('number')
    grouped = df.groupby(['a', category]).agg({'is':'sum', 'fs':'sum'})
    flattened = pd.DataFrame(grouped.to_records())
    flattened.columns = ['a', category]   ['%s(%s|%s)' % (op, column, category)
                                                         for column in list(num_data.columns) for op in aggregations]
    output.append(flattened)

output[0] looks something like this:

    a cat1  sum(is|cat1)  sum(fs|cat1)
0  a1    C             1           0.1
1  a1    D             5           0.5
2  a2    C             1           0.1
3  a2    D             2           0.2
4  a3    C             4           0.4
5  a3    D             6           0.6

which is obviously some ways off. I can't figure out how to further flatten it and rename the sum headers appropriately. Also, I don't know how to include cat2-based aggregation in the results.

Thanks!

CodePudding user response:

Use DataFrame.melt before aggregating, then add variable,value to groupby, then reshape by DataFrame.unstack, sorting third level of MultiIndex in columns and flattening it:

categories = ['cat1', 'cat2']

df1 = (df.melt(id_vars=df.columns.difference(categories))
        .groupby(['a','variable','value'])
        .agg({'is':'sum', 'fs':'sum'})
        .unstack([1,2])
        .sort_index(axis=1, level=2, sort_remaining=False))

df1.columns = [f'sum({a}|{b}={c})' for a,b,c in df1.columns]
df1 = df1.reset_index()
print (df1)
    a  sum(is|cat1=C)  sum(fs|cat1=C)  sum(is|cat1=D)  sum(fs|cat1=D)  \
0  a1               1             0.1               5             0.5   
1  a2               1             0.1               2             0.2   
2  a3               4             0.4               6             0.6   

   sum(is|cat2=X)  sum(fs|cat2=X)  sum(is|cat2=Y)  sum(fs|cat2=Y)  
0               4             0.4               2             0.2  
1               2             0.2               1             0.1  
2               6             0.6               4             0.4  

CodePudding user response:

Slightly different approach:

Group the dataframe on a, and category column, calculate the sum as agregate, add suffix, and unstack the last level i.e. -1, then finally rename the columns. Do the same for both the columns, finally join the dataframes.

cat1=df.groupby(['a', 'cat1']).agg(sum).add_suffix('|cat1').unstack(-1)
cat1.columns=[f"sum({x}={y})" for x,y in cat1.columns]

cat2=df.groupby(['a', 'cat2']).agg(sum).add_suffix('|cat2').unstack(-1)
cat2.columns=[f"sum({x}={y})" for x,y in cat2.columns]
pd.concat([cat1, cat2], axis=1)

OUTPUT:

    sum(is|cat1=C)  sum(is|cat1=D)  sum(fs|cat1=C)  sum(fs|cat1=D)  \
a                                                                    
a1               1               5             0.1             0.5   
a2               1               2             0.1             0.2   
a3               4               6             0.4             0.6   
    sum(is|cat2=X)  sum(is|cat2=Y)  sum(fs|cat2=X)  sum(fs|cat2=Y)  
a                                                                   
a1               4               2             0.4             0.2  
a2               2               1             0.2             0.1  
a3               6               4             0.6             0.4  
  • Related