Given a dataframe df
with date index as follows:
value
2017-03-31 NaN
2017-04-01 27863.7
2017-04-02 27278.5
2017-04-03 27278.5
2017-04-04 27278.5
...
2021-10-27 NaN
2021-10-28 NaN
2021-10-29 NaN
2021-10-30 NaN
2021-10-31 NaN
I'm able to shift value
column by one year use df['value'].shift(freq=pd.DateOffset(years=1))
:
Out:
2018-03-31 NaN
2018-04-01 27863.7
2018-04-02 27278.5
2018-04-03 27278.5
2018-04-04 27278.5
...
2022-10-27 NaN
2022-10-28 NaN
2022-10-29 NaN
2022-10-30 NaN
2022-10-31 NaN
But when I use it to replace orginal value by df['value'] = df['value'].shift(freq=pd.DateOffset(years=1))
, it raises an error:
ValueError: cannot reindex from a duplicate axis
Since the code below works smoothly, so I think the issue caused by NaN
s in value
column:
import pandas as pd
import numpy as np
np.random.seed(2021)
dates = pd.date_range('20130101', periods=720)
df = pd.DataFrame(np.random.randint(0, 100, size=(720, 3)), index=dates, columns=list('ABC'))
df
df.B = df.B.shift(freq=pd.DateOffset(years=1))
I also try with df['value'].shift(freq=relativedelta(years= 1))
, but it generates: pandas.errors.NullFrequencyError: Cannot shift with no freq
Someone could help to deal with this issue? Sincere thanks.
CodePudding user response:
Since the code below works smoothly, so I think the issue caused by NaNs in value column
No I don't think so. It's probably because in your 2nd sample you have only 1 leap year.
Reproducible error with 2 leap years:
# 2018 (366 days), 2019 (365 days) and 2020 (366 days)
dates = pd.date_range('20180101', periods=365*3 1)
df = pd.DataFrame(np.random.randint(0, 100, size=(365*3 1, 3)),
index=dates, columns=list('ABC'))
df.B = df.B.shift(freq=pd.DateOffset(years=1))
...
ValueError: cannot reindex from a duplicate axis
...
The example below works:
# 2017 (365 days), 2018 (366 days) and 2019 (365 days)
dates = pd.date_range('20170101', periods=365*3 1)
df = pd.DataFrame(np.random.randint(0, 100, size=(365*3 1, 3)),
index=dates, columns=list('ABC'))
df.B = df.B.shift(freq=pd.DateOffset(years=1))
Just look to value_counts
:
# 2018 -> 2020
>>> df.B.shift(freq=pd.DateOffset(years=1)).index.value_counts()
2021-02-28 2 # The duplicated index
2020-12-29 1
2021-01-04 1
2021-01-03 1
2021-01-02 1
..
2020-01-07 1
2020-01-08 1
2020-01-09 1
2020-01-10 1
2021-12-31 1
Length: 1095, dtype: int64
# 2017 -> 2019
>>> df.B.shift(freq=pd.DateOffset(years=1)).index.value_counts()
2018-01-01 1
2019-12-30 1
2020-01-05 1
2020-01-04 1
2020-01-03 1
..
2019-01-07 1
2019-01-08 1
2019-01-09 1
2019-01-10 1
2021-01-01 1
Length: 1096, dtype: int64
Solution
Obviously, the solution is to remove duplicated index, in our case '2021-02-28', by using resample('D')
and an aggregate function first
, last
, min
, max
, mean
, sum
or a custom one:
>>> df.B.shift(freq=pd.DateOffset(years=1))['2021-02-28']
2021-02-28 41
2021-02-28 96
Name: B, dtype: int64
>>> df.B.shift(freq=pd.DateOffset(years=1))['2021-02-28'] \
.resample('D').agg(('first', 'last', 'min', 'max', 'mean', 'sum')).T
2021-02-28
first 41.0
last 96.0
min 41.0
max 96.0
mean 68.5
sum 137.0
# Choose `last` for example
df.B = df.B.shift(freq=pd.DateOffset(years=1)).resample('D').last()
Note, you can replace .resample(...).func
by .loc[lambda x: x.index.duplicated()]