I am trying to use groupby to group by symbol and return the average of prior high volume days using pandas.
I create my data:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"date": ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06'],
"symbol": ['ABC', 'ABC', 'ABC', 'AAA', 'AAA', 'AAA'],
"change": [20, 1, 2, 3, 50, 100],
"volume": [20000000, 100, 3000, 500, 40000000, 60000000],
})
Filter by high volume and change:
high_volume_days = df[(df['volume'] >= 20000000) & (df['change'] >= 20)]
Then I get the last days volume (this works):
high_volume_days['previous_high_volume_day'] = high_volume_days.groupby('symbol')['volume'].shift(1)
But when I try to calculate the average of all the days per symbol:
high_volume_days['avg_volume_prior_days'] = df.groupby('symbol')['volume'].mean()
I am getting NaNs:
date symbol change volume previous_high_volume_day avg_volume_prior_days
0 2022-01-01 ABC 20 20000000 NaN NaN
4 2022-01-05 AAA 50 40000000 NaN NaN
5 2022-01-06 AAA 100 60000000 40000000.0 NaN
What am I missing here?
Desired output:
date symbol change volume previous_high_volume_day avg_volume_prior_days
0 2022-01-01 ABC 20 20000000 NaN 20000000
4 2022-01-05 AAA 50 40000000 NaN 40000000
5 2022-01-06 AAA 100 60000000 40000000.0 50000000
CodePudding user response:
high_volume_days['avg_volume_prior_days'] = high_volume_days.groupby('symbol', sort=False)['volume'].expanding().mean().droplevel(0)
high_volume_days
date symbol change volume previous_high_volume_day avg_volume_prior_days
0 2022-01-01 ABC 20 20000000 NaN 20000000.0
4 2022-01-05 AAA 50 40000000 NaN 40000000.0
5 2022-01-06 AAA 100 60000000 40000000.0 50000000.0
CodePudding user response:
Index misalignment: high_volume_days
is indexed by integers. The df.groupby(...)
is indexed by the symbol.
Use merge
instead:
high_volume_days = pd.merge(
high_volume_days,
df.groupby("symbol")["volume"].mean().rename("avg_volume_prior_days"),
left_on="symbol",
right_index=True,
)
CodePudding user response:
df.groupby('symbol')['volume'].mean()
returns:
symbol
AAA 33333500.0
ABC 6667700.0
Name: volume, dtype: float64
which is an aggregation of each group to a single value. Note that the groups (symbol) are the index of this series. When you try to assign it back to high_volume_days
, there is an index misalignment.
Instead of an aggregation (.mean()
is equivalent to .agg("mean")
), you should use a transformation: .transform("mean")
.
==== EDIT ====
Instead of the mean for all values, you're looking for the mean "thus far". You can typically do that using .expanding().mean()
, but since you're reassigning back to a column in high_volume_days
, you need to either drop the level that contains the symbols, or use a lambda:
high_volume_days.groupby('symbol')['volume'].expanding().mean().droplevel(0)
# or
high_volume_days.groupby('symbol')['volume'].transform(lambda x: x.expanding().mean())