Home > Software design >  calculating expanding mean in pandas with date multiindex
calculating expanding mean in pandas with date multiindex

Time:11-11

I have a dataframe with a multi-index the first level is a stock ticker the second is a date such as this:

import pandas as pd
import numpy as np

arrays = [
    np.array(["MSFT", "MSFT", "MSFT", "MSFT", "GOOG", "GOOG", "GOOG", "GOOG"]),
    np.array([pd.to_datetime("2022-04-05"), pd.to_datetime("2022-04-06"), pd.to_datetime("2022-04-07"), pd.to_datetime("2022-04-08"),pd.to_datetime("2022-04-05"), pd.to_datetime("2022-04-06"), pd.to_datetime("2022-04-07"), pd.to_datetime("2022-04-08")]),
]


df = pd.DataFrame(np.random.randn(8, 1), index=arrays, columns={"P/Sales"})
df
Ticker Date P/Sales
MSFT 2022-04-05 00:00:00 -1.1648041378350011
MSFT 2022-04-06 00:00:00 -2.352493882690841
MSFT 2022-04-07 00:00:00 0.6047960443400636
MSFT 2022-04-08 00:00:00 1.533852852883288
GOOG 2022-04-05 00:00:00 2.0738583230363195
GOOG 2022-04-06 00:00:00 -1.887463559617715
GOOG 2022-04-07 00:00:00 0.49868465858420324
GOOG 2022-04-08 00:00:00 0.900527823711884

I want to add another column which would contain the mean of P/Sales for all dates BEFORE the current line for that stock ticker.

Any help would be much appreciated.

Currently I am achieving this with a for loop but am thinking there must be a better way. Perhaps expanding.mean() but I can't seem to get the syntax right to do the mean only on prior dates for that ticker.

for ticker in  df.reset_index()['Ticker'].unique():    
    for date in df.loc[ticker].index:
      filtered_df = df.loc[ticker].loc[(df.loc[ticker].index < date)]
      mean = np.mean(np.asarray(filtered_df['P/Sales']))
      df.loc[(ticker,date),"Mean"] = mean

df
Ticker Date P/Sales Mean
MSFT 2022-04-05 00:00:00 -1.4361186774548342 NaN
MSFT 2022-04-06 00:00:00 0.5922672022039192 -1.4361186774548342
MSFT 2022-04-07 00:00:00 0.4844687977324626 -0.42192573762545754
MSFT 2022-04-08 00:00:00 -0.4125514467122677 -0.11979422583948417
GOOG 2022-04-05 00:00:00 -0.30860020353979284 NaN
GOOG 2022-04-06 00:00:00 0.1189685515177034 -0.30860020353979284
GOOG 2022-04-07 00:00:00 1.1730395696068878 -0.09481582601104471
GOOG 2022-04-08 00:00:00 -0.9778323004522275 0.3278026391949328

CodePudding user response:

Use GroupBy.transform with lambda function with Expanding.mean and Series.shift:

for ticker in  df.reset_index()['Ticker'].unique():    
    for date in df.loc[ticker].index:
      filtered_df = df.loc[ticker].loc[(df.loc[ticker].index < date)]
      print (filtered_df)
      mean = np.mean(np.asarray(filtered_df['P/Sales']))
      df.loc[(ticker,date),"Mean"] = mean


df['Mean1'] = (df.groupby('Ticker')['P/Sales']
                 .transform(lambda x: x.expanding().mean().shift()))

print (df)
                    P/Sales      Mean     Mean1
Ticker Date                                    
MSFT   2022-04-05 -1.164804       NaN       NaN
       2022-04-06 -2.352494 -1.164804 -1.164804
       2022-04-07  0.604796 -1.758649 -1.758649
       2022-04-08  1.533853 -0.970834 -0.970834
GOOG   2022-04-05  2.073858       NaN       NaN
       2022-04-06 -1.887464  2.073858  2.073858
       2022-04-07  0.498685  0.093197  0.093197
       2022-04-08  0.900528  0.228360  0.228360
  • Related