Let's say I have a daily data as follows:
import pandas as pd
import numpy as np
np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.randint(0, 100, size=(90, 3)), index=dates, columns=list('ABC'))
df
Out:
A B C
2013-02-26 85 57 0
2013-02-27 94 86 44
2013-02-28 62 91 29
2013-03-01 21 93 24
2013-03-02 12 70 70
.. .. ..
2013-05-22 57 13 81
2013-05-23 43 68 85
2013-05-24 55 50 53
2013-05-25 75 78 66
2013-05-26 70 93 3
For column A
and B
, I need to calculate their monthly pct change on daily basis, for example, the monthly pct change value of A
for 2013-05-26
will be calculated by: A's value in 2013-05-26 divided by the value in 2013-04-26 minus 1.
My idea is like this: create new columns 'A1', 'B1'
by shifting them one month forward, then df['A_MoM']
will be calculated by df['A']/df['A_shifted'] - 1
, same logic for column B
.
Since not all the months share same length of days, so I will use last day's value of last months, ie., to calculate 2013-03-30's pct change will be calculated by: 2013-03-30's value/2013-02-28's value - 1
.
I tried the code below, but it generates a dataframe with all NaN
s:
df[['A1', 'B1']] = df[['A', 'B']].shift(freq=pd.DateOffset(months=1)).resample('D').last().fillna(method=ffill)
df[['A_MoM', 'B_MoM']] = df[['A', 'B']].div(df[['A1', 'B1']], axis=0) - 1
Out:
A A1 B B1
2013-02-26 NaN NaN NaN NaN
2013-02-27 NaN NaN NaN NaN
2013-02-28 NaN NaN NaN NaN
2013-03-01 NaN NaN NaN NaN
2013-03-02 NaN NaN NaN NaN
.. .. .. ..
2013-05-22 NaN NaN NaN NaN
2013-05-23 NaN NaN NaN NaN
2013-05-24 NaN NaN NaN NaN
2013-05-25 NaN NaN NaN NaN
2013-05-26 NaN NaN NaN NaN
How could achieve that correctly? Sincere thanks at advance.
Edit:
df = pd.DataFrame(np.random.randint(0, 100, size=(90, 3)), index=dates, columns=['A_values', 'B_values', 'C'])
df.columns
df1 = df.filter(regex='_values$').shift(freq=pd.DateOffset(months=1)).resample('D').last().ffill().add_suffix('_shifted')
df2 = df.filter(regex='_values$').div(df1.to_numpy(), axis=0) - 1
df.join(df2.add_suffix('_MoM'))
Out:
ValueError: Unable to coerce to DataFrame, shape must be (90, 2): given (93, 2)
CodePudding user response:
Reason is different columns names, solution is converting df[['A1', 'B1']]
to numpy array:
df[['A1', 'B1']] = df[['A', 'B']].shift(freq=pd.DateOffset(months=1)).resample('D').last().ffill()
df[['A_MoM', 'B_MoM']] = df[['A', 'B']].div(df[['A1', 'B1']].to_numpy(), axis=0) - 1
print (df)
A B C A1 B1 A_MoM B_MoM
2013-02-26 85 57 0 NaN NaN NaN NaN
2013-02-27 94 86 44 NaN NaN NaN NaN
2013-02-28 62 91 29 NaN NaN NaN NaN
2013-03-01 21 93 24 NaN NaN NaN NaN
2013-03-02 12 70 70 NaN NaN NaN NaN
.. .. .. ... ... ... ...
2013-05-22 57 13 81 14.0 50.0 3.071429 -0.740000
2013-05-23 43 68 85 2.0 45.0 20.500000 0.511111
2013-05-24 55 50 53 89.0 52.0 -0.382022 -0.038462
2013-05-25 75 78 66 86.0 54.0 -0.127907 0.444444
2013-05-26 70 93 3 4.0 45.0 16.500000 1.066667
[90 rows x 7 columns]
Or if possible assign output to df1
, so columns names are not changed, so possible divide with same columns names, here A, B
correctly:
df1 = df[['A', 'B']].shift(freq=pd.DateOffset(months=1)).resample('D').last().ffill()
df[['A_MoM', 'B_MoM']] = df[['A', 'B']].div(df1, axis=0) - 1
print (df)
A B C A_MoM B_MoM
2013-02-26 85 57 0 NaN NaN
2013-02-27 94 86 44 NaN NaN
2013-02-28 62 91 29 NaN NaN
2013-03-01 21 93 24 NaN NaN
2013-03-02 12 70 70 NaN NaN
.. .. .. ... ...
2013-05-22 57 13 81 3.071429 -0.740000
2013-05-23 43 68 85 20.500000 0.511111
2013-05-24 55 50 53 -0.382022 -0.038462
2013-05-25 75 78 66 -0.127907 0.444444
2013-05-26 70 93 3 16.500000 1.066667
[90 rows x 5 columns]
EDIT: After resample
is changed also datetimeIndex, so added reindex
for same indices in both DataFrames
:
np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.randint(0, 100, size=(90, 3)), index=dates, columns=['A_values', 'B_values', 'C'])
df1 = df.filter(regex='_values$').shift(freq=pd.DateOffset(months=1)).resample('D').last().ffill()
print (df1.columns)
Index(['A_values', 'B_values'], dtype='object')
df2 = df.filter(regex='_values$').div(df1, axis=0).sub(1).reindex(df.index)
print (df.filter(regex='_values$').columns)
Index(['A_values', 'B_values'], dtype='object')
df = df.join(df2.add_suffix('MoM'))
print (df)
A_values B_values C A_valuesMoM B_valuesMoM
2013-02-26 85 57 0 NaN NaN
2013-02-27 94 86 44 NaN NaN
2013-02-28 62 91 29 NaN NaN
2013-03-01 21 93 24 NaN NaN
2013-03-02 12 70 70 NaN NaN
... ... .. ... ...
2013-05-22 57 13 81 3.071429 -0.740000
2013-05-23 43 68 85 20.500000 0.511111
2013-05-24 55 50 53 -0.382022 -0.038462
2013-05-25 75 78 66 -0.127907 0.444444
2013-05-26 70 93 3 16.500000 1.066667
[90 rows x 5 columns]