Suppose I have a DataFrame that looks something like this:
id | country | grade | category | amount | |
---|---|---|---|---|---|
0 | 7 | fr | a | mango | 52 |
1 | 5 | fr | b | banana | 68 |
2 | 7 | fr | a | banana | 73 |
3 | 4 | it | c | mango | 70 |
4 | 5 | fr | b | banana | 99 |
5 | 9 | uk | a | apple | 29 |
6 | 3 | uk | a | mango | 83 |
7 | 0 | uk | b | banana | 59 |
8 | 2 | it | c | mango | 11 |
9 | 9 | uk | a | banana | 91 |
10 | 0 | uk | b | mango | 95 |
11 | 8 | uk | a | mango | 30 |
12 | 3 | uk | a | mango | 82 |
13 | 1 | it | b | banana | 78 |
14 | 3 | uk | a | apple | 76 |
15 | 6 | it | c | apple | 76 |
16 | 2 | it | c | mango | 10 |
17 | 1 | it | b | mango | 30 |
18 | 9 | uk | a | banana | 17 |
19 | 2 | it | c | mango | 58 |
Where each id
belongs to a grade
and lives in a country
, and spends a certain amount
on various fruits (category
). Let's say the data covers a whole year. (Dataframe reproducible using the code below.)
import pandas as pd
df = pd.DataFrame({
"id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2],
"country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"],
"grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"],
"category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"],
"amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58]
})
I would like to add two columns to this DF.
First, I'd like a column giving the mean annual (ie total) spent on each category
by each combination of country
and grade
. So, for example, the Italy C-grade people have spent the following on mangos:
id: 4
total: 70
id: 2
total: 11 10 58 = 79
So the mean annual mango spend for Italy C-grade people is 74.5. I'd like to find this value for all of the country
/grade
/category
combinations.
The second column I want to add is the same but for the mean annual count for each combination.
Desired output and the best I could come up with:
I've managed to populate these two desired columns using the following code:
import math
combos = [[i,j,k] for i in set(df["country"]) for j in set(df["grade"]) for k in set(df["category"])]
for c in combos:
x = df.loc[(df["country"]==c[0])&(df["grade"]==c[1])&(df["category"]==c[2])]
m = x.groupby("id").sum()["amount"].mean()
k = x.groupby("id").count()["amount"].mean()
if math.isnan(m):
m = 0
if math.isnan(k):
k = 0
c.append(m)
c.append(k)
temp_grouping = pd.DataFrame(combos,columns=["country","grade","category","mean_totals","mean_counts"])
df = df.merge(temp_grouping,on=["country","grade","category"],how="left")
Which gives the desired output:
id | country | grade | category | amount | mean_totals | mean_counts | |
---|---|---|---|---|---|---|---|
0 | 7 | fr | a | mango | 52 | 52 | 1 |
1 | 5 | fr | b | banana | 68 | 167 | 2 |
2 | 7 | fr | a | banana | 73 | 73 | 1 |
3 | 4 | it | c | mango | 70 | 74.5 | 2 |
4 | 5 | fr | b | banana | 99 | 167 | 2 |
5 | 9 | uk | a | apple | 29 | 52.5 | 1 |
6 | 3 | uk | a | mango | 83 | 97.5 | 1.5 |
7 | 0 | uk | b | banana | 59 | 59 | 1 |
8 | 2 | it | c | mango | 11 | 74.5 | 2 |
9 | 9 | uk | a | banana | 91 | 108 | 2 |
10 | 0 | uk | b | mango | 95 | 95 | 1 |
11 | 8 | uk | a | mango | 30 | 97.5 | 1.5 |
12 | 3 | uk | a | mango | 82 | 97.5 | 1.5 |
13 | 1 | it | b | banana | 78 | 78 | 1 |
14 | 3 | uk | a | apple | 76 | 52.5 | 1 |
15 | 6 | it | c | apple | 76 | 76 | 1 |
16 | 2 | it | c | mango | 10 | 74.5 | 2 |
17 | 1 | it | b | mango | 30 | 30 | 1 |
18 | 9 | uk | a | banana | 17 | 108 | 2 |
19 | 2 | it | c | mango | 58 | 74.5 | 2 |
The above code works, but it is not usable on my real data because it is pretty slow. I'm searching, therefore, for a faster/more efficient solution to my problem. Thanks very much.
CodePudding user response:
It looks like you need a double groupby
. Once for the sum, once for the mean:
out = (df
.groupby(['country', 'grade', 'category', 'id']).sum()
.groupby(['country', 'grade', 'category']).mean()
)
output:
amount
country grade category
fr a banana 73.0
mango 52.0
b banana 167.0
it b banana 78.0
mango 30.0
c apple 76.0
mango 74.5
uk a apple 52.5
banana 108.0
mango 97.5
b banana 59.0
mango 95.0
CodePudding user response:
You can create mean_totals
column as follows:
mean_total_df = df.groupby(['country', 'category', 'grade']).apply(lambda x: x.amount.sum()/ x.id.nunique())
df['mean_totals'] = df.apply(lambda x: mean_total_df.loc[x.country, x.category, x.grade], axis=1)
which gives
0 7 fr a mango 52 52.0
1 5 fr b banana 68 167.0
2 7 fr a banana 73 73.0
3 4 it c mango 70 74.5
4 5 fr b banana 99 167.0
5 9 uk a apple 29 52.5
6 3 uk a mango 83 97.5
7 0 uk b banana 59 59.0
8 2 it c mango 11 74.5
9 9 uk a banana 91 108.0
10 0 uk b mango 95 95.0
11 8 uk a mango 30 97.5
12 3 uk a mango 82 97.5
13 1 it b banana 78 78.0
14 3 uk a apple 76 52.5
15 6 it c apple 76 76.0
16 2 it c mango 10 74.5
17 1 it b mango 30 30.0
18 9 uk a banana 17 108.0
19 2 it c mango 58 74.5
CodePudding user response:
I hope this will work fast.. First group and compute the required details and merge with existing df.
import pandas as pd
df = pd.DataFrame({
"id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2],
"country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"],
"grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"],
"category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"],
"amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58]
})
intermediate_df = df.groupby(by=['country','grade','category','id'], as_index=False).agg(int_totals=pd.NamedAgg(column='amount',aggfunc='sum'),int_counts=pd.NamedAgg(column='id',aggfunc='count')).groupby(by=['country','grade','category'], as_index=False).agg(mean_totals=pd.NamedAgg(column='int_totals',aggfunc='mean'),mean_counts=pd.NamedAgg(column='int_counts',aggfunc='mean'))
output_df = pd.merge(df,intermediate_df, left_on = ['country','grade','category'],right_on = ['country','grade','category'], how='left')
print(output_df)