Efficient mean and total aggregation over multiple Pandas DataFrame columns-CodePudding

Suppose I have a DataFrame that looks something like this:

	id	country	grade	category	amount
0	7	fr	a	mango	52
1	5	fr	b	banana	68
2	7	fr	a	banana	73
3	4	it	c	mango	70
4	5	fr	b	banana	99
5	9	uk	a	apple	29
6	3	uk	a	mango	83
7	0	uk	b	banana	59
8	2	it	c	mango	11
9	9	uk	a	banana	91
10	0	uk	b	mango	95
11	8	uk	a	mango	30
12	3	uk	a	mango	82
13	1	it	b	banana	78
14	3	uk	a	apple	76
15	6	it	c	apple	76
16	2	it	c	mango	10
17	1	it	b	mango	30
18	9	uk	a	banana	17
19	2	it	c	mango	58

Where each id belongs to a grade and lives in a country, and spends a certain amount on various fruits (category). Let's say the data covers a whole year. (Dataframe reproducible using the code below.)

import pandas as pd
df = pd.DataFrame({
    "id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2],
    "country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"],
    "grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"],
    "category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"],
    "amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58]
})

I would like to add two columns to this DF.

First, I'd like a column giving the mean annual (ie total) spent on each category by each combination of country and grade. So, for example, the Italy C-grade people have spent the following on mangos:

id: 4 total: 70

id: 2 total: 11 10 58 = 79

So the mean annual mango spend for Italy C-grade people is 74.5. I'd like to find this value for all of the country/grade/category combinations.

The second column I want to add is the same but for the mean annual count for each combination.

Desired output and the best I could come up with:

I've managed to populate these two desired columns using the following code:

import math
combos = [[i,j,k] for i in set(df["country"]) for j in set(df["grade"]) for k in set(df["category"])]
for c in combos:
    x = df.loc[(df["country"]==c[0])&(df["grade"]==c[1])&(df["category"]==c[2])]
    m = x.groupby("id").sum()["amount"].mean()
    k = x.groupby("id").count()["amount"].mean()
    if math.isnan(m):
        m = 0
    if math.isnan(k):
        k = 0        
    c.append(m)
    c.append(k)
temp_grouping = pd.DataFrame(combos,columns=["country","grade","category","mean_totals","mean_counts"])

df = df.merge(temp_grouping,on=["country","grade","category"],how="left")

Which gives the desired output:

	id	country	grade	category	amount	mean_totals	mean_counts
0	7	fr	a	mango	52	52	1
1	5	fr	b	banana	68	167	2
2	7	fr	a	banana	73	73	1
3	4	it	c	mango	70	74.5	2
4	5	fr	b	banana	99	167	2
5	9	uk	a	apple	29	52.5	1
6	3	uk	a	mango	83	97.5	1.5
7	0	uk	b	banana	59	59	1
8	2	it	c	mango	11	74.5	2
9	9	uk	a	banana	91	108	2
10	0	uk	b	mango	95	95	1
11	8	uk	a	mango	30	97.5	1.5
12	3	uk	a	mango	82	97.5	1.5
13	1	it	b	banana	78	78	1
14	3	uk	a	apple	76	52.5	1
15	6	it	c	apple	76	76	1
16	2	it	c	mango	10	74.5	2
17	1	it	b	mango	30	30	1
18	9	uk	a	banana	17	108	2
19	2	it	c	mango	58	74.5	2

The above code works, but it is not usable on my real data because it is pretty slow. I'm searching, therefore, for a faster/more efficient solution to my problem. Thanks very much.

CodePudding user response：

It looks like you need a double groupby. Once for the sum, once for the mean:

out = (df
 .groupby(['country', 'grade', 'category', 'id']).sum()
 .groupby(['country', 'grade', 'category']).mean()
)

output:

                        amount
country grade category        
fr      a     banana      73.0
              mango       52.0
        b     banana     167.0
it      b     banana      78.0
              mango       30.0
        c     apple       76.0
              mango       74.5
uk      a     apple       52.5
              banana     108.0
              mango       97.5
        b     banana      59.0
              mango       95.0

CodePudding user response：

You can create mean_totals column as follows:

mean_total_df = df.groupby(['country', 'category', 'grade']).apply(lambda x: x.amount.sum()/ x.id.nunique())
df['mean_totals'] = df.apply(lambda x: mean_total_df.loc[x.country, x.category, x.grade], axis=1)

which gives

0   7   fr  a   mango   52  52.0
1   5   fr  b   banana  68  167.0
2   7   fr  a   banana  73  73.0
3   4   it  c   mango   70  74.5
4   5   fr  b   banana  99  167.0
5   9   uk  a   apple   29  52.5
6   3   uk  a   mango   83  97.5
7   0   uk  b   banana  59  59.0
8   2   it  c   mango   11  74.5
9   9   uk  a   banana  91  108.0
10  0   uk  b   mango   95  95.0
11  8   uk  a   mango   30  97.5
12  3   uk  a   mango   82  97.5
13  1   it  b   banana  78  78.0
14  3   uk  a   apple   76  52.5
15  6   it  c   apple   76  76.0
16  2   it  c   mango   10  74.5
17  1   it  b   mango   30  30.0
18  9   uk  a   banana  17  108.0
19  2   it  c   mango   58  74.5

CodePudding user response：

I hope this will work fast.. First group and compute the required details and merge with existing df.

import pandas as pd
df = pd.DataFrame({
"id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2],
"country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"],
"grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"],
"category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"],
"amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58]
})
intermediate_df = df.groupby(by=['country','grade','category','id'], as_index=False).agg(int_totals=pd.NamedAgg(column='amount',aggfunc='sum'),int_counts=pd.NamedAgg(column='id',aggfunc='count')).groupby(by=['country','grade','category'], as_index=False).agg(mean_totals=pd.NamedAgg(column='int_totals',aggfunc='mean'),mean_counts=pd.NamedAgg(column='int_counts',aggfunc='mean'))
output_df = pd.merge(df,intermediate_df, left_on = ['country','grade','category'],right_on = ['country','grade','category'], how='left')
print(output_df)

Output_dataframe