Home > Software engineering >  Calculate monthly percentage change of daily basis data in Python
Calculate monthly percentage change of daily basis data in Python

Time:11-10

Let's say I have a daily data as follows:

import pandas as pd
import numpy as np

np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.randint(0, 100, size=(90, 3)), index=dates, columns=list('ABC'))
df

Out:

             A   B   C
2013-02-26  85  57   0
2013-02-27  94  86  44
2013-02-28  62  91  29
2013-03-01  21  93  24
2013-03-02  12  70  70
        ..  ..  ..
2013-05-22  57  13  81
2013-05-23  43  68  85
2013-05-24  55  50  53
2013-05-25  75  78  66
2013-05-26  70  93   3

For column A and B, I need to calculate their monthly pct change on daily basis, for example, the monthly pct change value of A for 2013-05-26 will be calculated by: A's value in 2013-05-26 divided by the value in 2013-04-26 minus 1.

My idea is like this: create new columns 'A1', 'B1' by shifting them one month forward, then df['A_MoM'] will be calculated by df['A']/df['A_shifted'] - 1, same logic for column B.

Since not all the months share same length of days, so I will use last day's value of last months, ie., to calculate 2013-03-30's pct change will be calculated by: 2013-03-30's value/2013-02-28's value - 1.

I tried the code below, but it generates a dataframe with all NaNs:

df[['A1', 'B1']] = df[['A', 'B']].shift(freq=pd.DateOffset(months=1)).resample('D').last().fillna(method=ffill)
df[['A_MoM', 'B_MoM']] = df[['A', 'B']].div(df[['A1', 'B1']], axis=0) - 1

Out:

             A  A1   B  B1
2013-02-26 NaN NaN NaN NaN
2013-02-27 NaN NaN NaN NaN
2013-02-28 NaN NaN NaN NaN
2013-03-01 NaN NaN NaN NaN
2013-03-02 NaN NaN NaN NaN
        ..  ..  ..  ..
2013-05-22 NaN NaN NaN NaN
2013-05-23 NaN NaN NaN NaN
2013-05-24 NaN NaN NaN NaN
2013-05-25 NaN NaN NaN NaN
2013-05-26 NaN NaN NaN NaN

How could achieve that correctly? Sincere thanks at advance.

Edit:

df = pd.DataFrame(np.random.randint(0, 100, size=(90, 3)), index=dates, columns=['A_values', 'B_values', 'C'])
df.columns

df1 = df.filter(regex='_values$').shift(freq=pd.DateOffset(months=1)).resample('D').last().ffill().add_suffix('_shifted')
df2 = df.filter(regex='_values$').div(df1.to_numpy(), axis=0) - 1
df.join(df2.add_suffix('_MoM'))

Out:

ValueError: Unable to coerce to DataFrame, shape must be (90, 2): given (93, 2)

CodePudding user response:

Reason is different columns names, solution is converting df[['A1', 'B1']] to numpy array:

df[['A1', 'B1']] = df[['A', 'B']].shift(freq=pd.DateOffset(months=1)).resample('D').last().ffill()
df[['A_MoM', 'B_MoM']] = df[['A', 'B']].div(df[['A1', 'B1']].to_numpy(), axis=0) - 1
print (df)
             A   B   C    A1    B1      A_MoM     B_MoM
2013-02-26  85  57   0   NaN   NaN        NaN       NaN
2013-02-27  94  86  44   NaN   NaN        NaN       NaN
2013-02-28  62  91  29   NaN   NaN        NaN       NaN
2013-03-01  21  93  24   NaN   NaN        NaN       NaN
2013-03-02  12  70  70   NaN   NaN        NaN       NaN
        ..  ..  ..   ...   ...        ...       ...
2013-05-22  57  13  81  14.0  50.0   3.071429 -0.740000
2013-05-23  43  68  85   2.0  45.0  20.500000  0.511111
2013-05-24  55  50  53  89.0  52.0  -0.382022 -0.038462
2013-05-25  75  78  66  86.0  54.0  -0.127907  0.444444
2013-05-26  70  93   3   4.0  45.0  16.500000  1.066667

[90 rows x 7 columns]

Or if possible assign output to df1, so columns names are not changed, so possible divide with same columns names, here A, B correctly:

df1 = df[['A', 'B']].shift(freq=pd.DateOffset(months=1)).resample('D').last().ffill()
df[['A_MoM', 'B_MoM']] = df[['A', 'B']].div(df1, axis=0) - 1
print (df)
             A   B   C      A_MoM     B_MoM
2013-02-26  85  57   0        NaN       NaN
2013-02-27  94  86  44        NaN       NaN
2013-02-28  62  91  29        NaN       NaN
2013-03-01  21  93  24        NaN       NaN
2013-03-02  12  70  70        NaN       NaN
        ..  ..  ..        ...       ...
2013-05-22  57  13  81   3.071429 -0.740000
2013-05-23  43  68  85  20.500000  0.511111
2013-05-24  55  50  53  -0.382022 -0.038462
2013-05-25  75  78  66  -0.127907  0.444444
2013-05-26  70  93   3  16.500000  1.066667

[90 rows x 5 columns]

EDIT: After resample is changed also datetimeIndex, so added reindex for same indices in both DataFrames:

np.random.seed(2021)

dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.randint(0, 100, size=(90, 3)), index=dates, columns=['A_values', 'B_values', 'C'])

df1 = df.filter(regex='_values$').shift(freq=pd.DateOffset(months=1)).resample('D').last().ffill()
print (df1.columns)
Index(['A_values', 'B_values'], dtype='object')

df2 = df.filter(regex='_values$').div(df1, axis=0).sub(1).reindex(df.index)

print (df.filter(regex='_values$').columns)
Index(['A_values', 'B_values'], dtype='object')

df = df.join(df2.add_suffix('MoM'))
print (df)
            A_values  B_values   C  A_valuesMoM  B_valuesMoM
2013-02-26        85        57   0          NaN          NaN
2013-02-27        94        86  44          NaN          NaN
2013-02-28        62        91  29          NaN          NaN
2013-03-01        21        93  24          NaN          NaN
2013-03-02        12        70  70          NaN          NaN
             ...       ...  ..          ...          ...
2013-05-22        57        13  81     3.071429    -0.740000
2013-05-23        43        68  85    20.500000     0.511111
2013-05-24        55        50  53    -0.382022    -0.038462
2013-05-25        75        78  66    -0.127907     0.444444
2013-05-26        70        93   3    16.500000     1.066667

[90 rows x 5 columns]
  • Related