I have been struggling with grouping by 2 conditions (or grouping "twice"?), adding an if unique value condition, and do the mean of those groups.
I created a sample dataset so it can be better understood: A chef makes dishes, made out of ingredients, which each cost X. They can be costing differently depending on the day or the place bought or whatever, not relevant.
chef_id dish ingr price
0 1 dish_1 ingr1 4
1 1 dish_1 ingr2 3
2 1 dish_1 ingr3 5
3 2 dish_2 ingr1 1
4 2 dish_2 ingr2 3
5 2 dish_4 ingr1 2
6 3 dish_3 ingr1 6
7 3 dish_3 ingr2 4
In my real dataset, there is also a date, but I don't think it's relevant. My main objective is to group by chefs, and dishes per chef, to do an average of the cost per dish (per chef). But all this, ONLY if the chef has made at least 2 DIFFERENT dishes in total (regardless the dates, that is why I don't think it's relevant). If it has done only one (even if it's several times). The ingredients are only relevant for the price, so that is why I am not using it as a condition.
So the desired output would be something like:
chef_id,dish,price('mean')
2,dish_2,2
2,dish_4,2
I've checked counting values per group, deleting rows based on a condition (I also found a not in, taking the list of chefs with unique dishes here), and grouping by different conditions
This works for just a mean of all dishes of all chefs, which is not my objective:
df_chef.groupby(['chef_id', 'dish'], as_index=False).mean('price')
This is my trying so far, not being able to accomplish it.
df_uniques = df.groupby('chef_id')['dish'].unique()
new_unique = df_uniques.to_frame().reset_index()
to_del = new_unique.loc[((new_unique['dish'].apply(len)) == 1)] #if it has only 1 dish
users_list = to_del['chef_id'].tolist() #list of chefs to delete
modi_df = df208[~df208['chef_id'].isin(users_list)]
modi_df.groupby(['chef_id','dish'])['price'].mean()
modi_df
According to the docs, once I have filtered out the chefs I don't want, I should be able to group by 2 cols, and then do a mean of one other col:
df.groupby(['group_col1', 'group_col2'])['value_col'].mean()
CodePudding user response:
First, get the rows with chefs that make at least two different dishes.
at_least_two_different_dishes = df.groupby('chef_id').dish.transform('nunique') > 1
Then use this and .groupby()
to get the desired result:
df[at_least_two_different_dishes].groupby(['chef_id', 'dish'], as_index=False).price.mean()
Output:
chef_id dish price
0 2 dish_2 2.0
1 2 dish_4 2.0
CodePudding user response:
The following drops the data regarding chefs who have only prepared one dish:
dish_ct = df_chef.groupby(['chef_id'])['dish'].nunique()
chefs = dish_ct[dish_ct > 1].index
df_chef = df_chef[df_chef['chef_id'].isin(chefs)]
After that, you can use your the same approach as before: the line
print(df_chef.groupby(['chef_id', 'dish'], as_index=False).mean('price'))
results in
chef_id dish price
0 2 dish_2 2.0
1 2 dish_4 2.0