I'm wondering if its possible to create your own groupby function that runs a different method for the output in a single column depending on if the index is in some list or not. For example:
df = pd.DataFrame({'ID' : [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
'Data' : [5, 7, 6, 13, 14, 11, 10, 2, 4, 3]})
some_list = [2, 3]
I want to group by ID column, and return an average of the Data column (df.groupby('ID').mean()
for most values) However, if ID is in some_list
then I would like the average to be calculated as the sum of Data divided by 4 (df.groupby('ID').sum()/4
). The output for the above would look as below:
ID Data
0 1 6.00
1 2 12.00
2 3 2.25
I know I could do both methods separately and join into one column after doing the groupby, but I was wondering if its possible to do this in one step? Maybe with df.groupby('ID').apply(function)
?
I've looked at this question, but it didn't help me.
CodePudding user response:
Try groupby
with apply
and a condition:
df.groupby('ID', as_index=False)['Data'].apply(lambda x: x.sum() / 4 if x.name in some_list else x.mean())
Output:
ID Data
0 1 6.00
1 2 12.00
2 3 2.25
CodePudding user response:
If performance is important dont use groupby.apply
, you can filter rows for aggregate sum
with division and for aggregate mean
:
s = df[df['ID'].isin(some_list)].groupby('ID')['Data'].sum().div(4)
df = s.combine_first(df.groupby('ID')['Data'].mean()).reset_index()
print (df)
ID Data
0 1 6.00
1 2 12.00
2 3 2.25
Here is alternative solution:
df = df.groupby('ID')['Data'].agg(['sum','mean']).reset_index()
df['Value'] = np.where(df['ID'].isin(some_list), df.pop('sum').div(4), df.pop('mean'))
print (df)
ID Value
0 1 6.00
1 2 12.00
2 3 2.25