Home > Enterprise >  Efficient mean and total aggregation over multiple Pandas DataFrame columns
Efficient mean and total aggregation over multiple Pandas DataFrame columns

Time:08-21

Suppose I have a DataFrame that looks something like this:

id country grade category amount
0 7 fr a mango 52
1 5 fr b banana 68
2 7 fr a banana 73
3 4 it c mango 70
4 5 fr b banana 99
5 9 uk a apple 29
6 3 uk a mango 83
7 0 uk b banana 59
8 2 it c mango 11
9 9 uk a banana 91
10 0 uk b mango 95
11 8 uk a mango 30
12 3 uk a mango 82
13 1 it b banana 78
14 3 uk a apple 76
15 6 it c apple 76
16 2 it c mango 10
17 1 it b mango 30
18 9 uk a banana 17
19 2 it c mango 58

Where each id belongs to a grade and lives in a country, and spends a certain amount on various fruits (category). Let's say the data covers a whole year. (Dataframe reproducible using the code below.)

import pandas as pd
df = pd.DataFrame({
    "id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2],
    "country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"],
    "grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"],
    "category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"],
    "amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58]
})

I would like to add two columns to this DF.

First, I'd like a column giving the mean annual (ie total) spent on each category by each combination of country and grade. So, for example, the Italy C-grade people have spent the following on mangos:

id: 4 total: 70

id: 2 total: 11 10 58 = 79

So the mean annual mango spend for Italy C-grade people is 74.5. I'd like to find this value for all of the country/grade/category combinations.

The second column I want to add is the same but for the mean annual count for each combination.

Desired output and the best I could come up with:

I've managed to populate these two desired columns using the following code:

import math
combos = [[i,j,k] for i in set(df["country"]) for j in set(df["grade"]) for k in set(df["category"])]
for c in combos:
    x = df.loc[(df["country"]==c[0])&(df["grade"]==c[1])&(df["category"]==c[2])]
    m = x.groupby("id").sum()["amount"].mean()
    k = x.groupby("id").count()["amount"].mean()
    if math.isnan(m):
        m = 0
    if math.isnan(k):
        k = 0        
    c.append(m)
    c.append(k)
temp_grouping = pd.DataFrame(combos,columns=["country","grade","category","mean_totals","mean_counts"])

df = df.merge(temp_grouping,on=["country","grade","category"],how="left")

Which gives the desired output:

id country grade category amount mean_totals mean_counts
0 7 fr a mango 52 52 1
1 5 fr b banana 68 167 2
2 7 fr a banana 73 73 1
3 4 it c mango 70 74.5 2
4 5 fr b banana 99 167 2
5 9 uk a apple 29 52.5 1
6 3 uk a mango 83 97.5 1.5
7 0 uk b banana 59 59 1
8 2 it c mango 11 74.5 2
9 9 uk a banana 91 108 2
10 0 uk b mango 95 95 1
11 8 uk a mango 30 97.5 1.5
12 3 uk a mango 82 97.5 1.5
13 1 it b banana 78 78 1
14 3 uk a apple 76 52.5 1
15 6 it c apple 76 76 1
16 2 it c mango 10 74.5 2
17 1 it b mango 30 30 1
18 9 uk a banana 17 108 2
19 2 it c mango 58 74.5 2

The above code works, but it is not usable on my real data because it is pretty slow. I'm searching, therefore, for a faster/more efficient solution to my problem. Thanks very much.

CodePudding user response:

It looks like you need a double groupby. Once for the sum, once for the mean:

out = (df
 .groupby(['country', 'grade', 'category', 'id']).sum()
 .groupby(['country', 'grade', 'category']).mean()
)

output:

                        amount
country grade category        
fr      a     banana      73.0
              mango       52.0
        b     banana     167.0
it      b     banana      78.0
              mango       30.0
        c     apple       76.0
              mango       74.5
uk      a     apple       52.5
              banana     108.0
              mango       97.5
        b     banana      59.0
              mango       95.0

CodePudding user response:

You can create mean_totals column as follows:

mean_total_df = df.groupby(['country', 'category', 'grade']).apply(lambda x: x.amount.sum()/ x.id.nunique())
df['mean_totals'] = df.apply(lambda x: mean_total_df.loc[x.country, x.category, x.grade], axis=1)

which gives

0   7   fr  a   mango   52  52.0
1   5   fr  b   banana  68  167.0
2   7   fr  a   banana  73  73.0
3   4   it  c   mango   70  74.5
4   5   fr  b   banana  99  167.0
5   9   uk  a   apple   29  52.5
6   3   uk  a   mango   83  97.5
7   0   uk  b   banana  59  59.0
8   2   it  c   mango   11  74.5
9   9   uk  a   banana  91  108.0
10  0   uk  b   mango   95  95.0
11  8   uk  a   mango   30  97.5
12  3   uk  a   mango   82  97.5
13  1   it  b   banana  78  78.0
14  3   uk  a   apple   76  52.5
15  6   it  c   apple   76  76.0
16  2   it  c   mango   10  74.5
17  1   it  b   mango   30  30.0
18  9   uk  a   banana  17  108.0
19  2   it  c   mango   58  74.5

CodePudding user response:

I hope this will work fast.. First group and compute the required details and merge with existing df.

import pandas as pd
df = pd.DataFrame({
"id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2],
"country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"],
"grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"],
"category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"],
"amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58]
})
intermediate_df = df.groupby(by=['country','grade','category','id'], as_index=False).agg(int_totals=pd.NamedAgg(column='amount',aggfunc='sum'),int_counts=pd.NamedAgg(column='id',aggfunc='count')).groupby(by=['country','grade','category'], as_index=False).agg(mean_totals=pd.NamedAgg(column='int_totals',aggfunc='mean'),mean_counts=pd.NamedAgg(column='int_counts',aggfunc='mean'))
output_df = pd.merge(df,intermediate_df, left_on = ['country','grade','category'],right_on = ['country','grade','category'], how='left')
print(output_df)

Output_dataframe

  • Related