groupby and mean returning NaN-CodePudding

I am trying to use groupby to group by symbol and return the average of prior high volume days using pandas.

I create my data:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "date": ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06'],
    "symbol": ['ABC', 'ABC', 'ABC', 'AAA', 'AAA', 'AAA'],
    "change": [20, 1, 2, 3, 50, 100],
    "volume": [20000000, 100, 3000, 500, 40000000, 60000000],
})

Filter by high volume and change:

high_volume_days = df[(df['volume'] >= 20000000) & (df['change'] >= 20)]

Then I get the last days volume (this works):

high_volume_days['previous_high_volume_day'] = high_volume_days.groupby('symbol')['volume'].shift(1)

But when I try to calculate the average of all the days per symbol:

high_volume_days['avg_volume_prior_days'] = df.groupby('symbol')['volume'].mean()

I am getting NaNs:

         date symbol  change    volume  previous_high_volume_day  avg_volume_prior_days
0  2022-01-01    ABC      20  20000000                       NaN                    NaN
4  2022-01-05    AAA      50  40000000                       NaN                    NaN
5  2022-01-06    AAA     100  60000000                40000000.0                    NaN

What am I missing here?

Desired output:

       date symbol  change    volume  previous_high_volume_day  avg_volume_prior_days
0  2022-01-01    ABC      20  20000000                       NaN                    20000000
4  2022-01-05    AAA      50  40000000                       NaN                    40000000
5  2022-01-06    AAA     100  60000000                40000000.0                    50000000

CodePudding user response：

high_volume_days['avg_volume_prior_days'] = high_volume_days.groupby('symbol', sort=False)['volume'].expanding().mean().droplevel(0)

high_volume_days

    date        symbol  change  volume  previous_high_volume_day    avg_volume_prior_days
0   2022-01-01  ABC     20      20000000    NaN                     20000000.0
4   2022-01-05  AAA     50      40000000    NaN                     40000000.0
5   2022-01-06  AAA     100     60000000    40000000.0              50000000.0

CodePudding user response：

Index misalignment: high_volume_days is indexed by integers. The df.groupby(...) is indexed by the symbol.

Use merge instead:

high_volume_days = pd.merge(
    high_volume_days,
    df.groupby("symbol")["volume"].mean().rename("avg_volume_prior_days"),
    left_on="symbol",
    right_index=True,
)

CodePudding user response：

df.groupby('symbol')['volume'].mean() returns:

symbol
AAA    33333500.0
ABC     6667700.0
Name: volume, dtype: float64

which is an aggregation of each group to a single value. Note that the groups (symbol) are the index of this series. When you try to assign it back to high_volume_days, there is an index misalignment.

Instead of an aggregation (.mean() is equivalent to .agg("mean")), you should use a transformation: .transform("mean").

==== EDIT ====

Instead of the mean for all values, you're looking for the mean "thus far". You can typically do that using .expanding().mean(), but since you're reassigning back to a column in high_volume_days, you need to either drop the level that contains the symbols, or use a lambda:

high_volume_days.groupby('symbol')['volume'].expanding().mean().droplevel(0)
# or
high_volume_days.groupby('symbol')['volume'].transform(lambda x: x.expanding().mean())