Mean value of 2 group by's if value is not unique pandas-CodePudding

I have been struggling with grouping by 2 conditions (or grouping "twice"?), adding an if unique value condition, and do the mean of those groups.

I created a sample dataset so it can be better understood: A chef makes dishes, made out of ingredients, which each cost X. They can be costing differently depending on the day or the place bought or whatever, not relevant.

  chef_id    dish   ingr price
0       1  dish_1  ingr1     4
1       1  dish_1  ingr2     3
2       1  dish_1  ingr3     5
3       2  dish_2  ingr1     1
4       2  dish_2  ingr2     3
5       2  dish_4  ingr1     2
6       3  dish_3  ingr1     6
7       3  dish_3  ingr2     4

In my real dataset, there is also a date, but I don't think it's relevant. My main objective is to group by chefs, and dishes per chef, to do an average of the cost per dish (per chef). But all this, ONLY if the chef has made at least 2 DIFFERENT dishes in total (regardless the dates, that is why I don't think it's relevant). If it has done only one (even if it's several times). The ingredients are only relevant for the price, so that is why I am not using it as a condition.

So the desired output would be something like:

chef_id,dish,price('mean')
2,dish_2,2
2,dish_4,2

I've checked counting values per group, deleting rows based on a condition (I also found a not in, taking the list of chefs with unique dishes here), and grouping by different conditions

This works for just a mean of all dishes of all chefs, which is not my objective:

df_chef.groupby(['chef_id', 'dish'], as_index=False).mean('price')

This is my trying so far, not being able to accomplish it.

df_uniques = df.groupby('chef_id')['dish'].unique()
new_unique = df_uniques.to_frame().reset_index()

to_del = new_unique.loc[((new_unique['dish'].apply(len)) == 1)] #if it has only 1 dish
users_list = to_del['chef_id'].tolist() #list of chefs to delete
modi_df = df208[~df208['chef_id'].isin(users_list)]
modi_df.groupby(['chef_id','dish'])['price'].mean()
modi_df

According to the docs, once I have filtered out the chefs I don't want, I should be able to group by 2 cols, and then do a mean of one other col:

df.groupby(['group_col1', 'group_col2'])['value_col'].mean()

CodePudding user response：

First, get the rows with chefs that make at least two different dishes.

at_least_two_different_dishes = df.groupby('chef_id').dish.transform('nunique') > 1

Then use this and .groupby() to get the desired result:

df[at_least_two_different_dishes].groupby(['chef_id', 'dish'], as_index=False).price.mean()

Output:

   chef_id    dish  price
0        2  dish_2    2.0
1        2  dish_4    2.0

CodePudding user response：

The following drops the data regarding chefs who have only prepared one dish:

dish_ct = df_chef.groupby(['chef_id'])['dish'].nunique()
chefs = dish_ct[dish_ct > 1].index
df_chef = df_chef[df_chef['chef_id'].isin(chefs)]

After that, you can use your the same approach as before: the line

print(df_chef.groupby(['chef_id', 'dish'], as_index=False).mean('price'))

results in

   chef_id    dish  price
0        2  dish_2    2.0
1        2  dish_4    2.0