I have a data frame with Boolean columns denoting holidays. I woudl like to add another Boolean column that flags the two days before any column and two days after, for any holiday column.
For example, take the data below:
import pandas as pd
from pandas.tseries.offsets import DateOffset
date_range = pd.date_range(start = pd.to_datetime("2020-01-10") DateOffset(days=1), periods = 45, freq = 'D').to_list()
peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
jellyday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0]
holiday_dict = {'date':date_range,
'peanutbutterday':peanutbutterday,
'jellyday':jellyday,
'crackerday':crackerday}
df = pd.DataFrame.from_dict(holiday_dict)
What I would expect is an additional column titled below as holiday_bookend
that looks like the following:
import pandas as pd
from pandas.tseries.offsets import DateOffset
date_range = pd.date_range(start = pd.to_datetime("2020-01-10") DateOffset(days=1), periods = 45, freq = 'D').to_list()
peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
jellyday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0]
holiday_bookend = [0,0,0,0,0,1,1,0,0,1,1,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,0,0,0,0,0,0]
holiday_dict = {'date':date_range,
'peanutbutterday':peanutbutterday,
'jellyday':jellyday,
'crackerday':crackerday,
'holiday_bookend':holiday_bookend}
df = pd.DataFrame.from_dict(holiday_dict)
I'm not sure if I should try with a loop. I haven't conceptually worked that out so I'm kind of stuck.
I tried to incorporate the suggestion from here: How To Identify days before and after a holiday within pandas? but it seemed I needed to put a column for each holiday. I need one column that takes into account all holiday columns.
CodePudding user response:
basically add two extra columns:
- detect when a holiday has occurred (use
any
method). - two days before and two days after (use
shift
method).
The columns work like this:
- The
any
method contains all the holiday days. - The
shift
method has -2 and 2 for 2 day shifting.
side note:
avoid using for loops inside a pandas dataframe. the vectorised methods will always be faster and preferable.
So you can do this:
import pandas as pd
from pandas.tseries.offsets import DateOffset
date_range = pd.date_range(start = pd.to_datetime("2020-01-10") DateOffset(days=1), periods = 45, freq = 'D').to_list()
peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
jellyday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0]
holiday_dict = {'date':date_range,
'peanutbutterday':peanutbutterday,
'jellyday':jellyday,
'crackerday':crackerday}
df = pd.DataFrame.from_dict(holiday_dict)
# add extra colums
df["holiday"] = df[["peanutbutterday", "jellyday", "crackerday"]].any(axis=1).astype(bool)
# column with 2 days before and 2 days after
df["holiday_extended"] = df["holiday"] | df["holiday"].shift(-2) | df["holiday"].shift(2)
which returns this:
date peanutbutterday jellyday crackerday holiday holiday_extended
0 2020-01-11 0 0 0 False False
1 2020-01-12 0 0 0 False False
2 2020-01-13 0 0 0 False False
3 2020-01-14 0 0 0 False False
4 2020-01-15 0 0 0 False False
5 2020-01-16 0 0 0 False True
6 2020-01-17 0 0 0 False True
7 2020-01-18 1 0 0 True True
8 2020-01-19 1 0 0 True True
9 2020-01-20 0 0 0 False True
10 2020-01-21 0 0 0 False True
11 2020-01-22 0 0 0 False True
12 2020-01-23 0 0 0 False True
13 2020-01-24 0 1 1 True True
14 2020-01-25 0 1 1 True True
15 2020-01-26 0 0 0 False True
16 2020-01-27 0 0 0 False True
17 2020-01-28 0 0 0 False False
18 2020-01-29 0 0 0 False False
19 2020-01-30 0 0 0 False False
20 2020-01-31 0 0 0 False False
21 2020-02-01 0 0 0 False False
22 2020-02-02 0 0 0 False False
23 2020-02-03 0 0 0 False False
24 2020-02-04 0 0 0 False False
25 2020-02-05 0 0 0 False False
26 2020-02-06 0 0 0 False False
27 2020-02-07 0 0 0 False False
28 2020-02-08 0 0 0 False False
29 2020-02-09 0 0 0 False False
30 2020-02-10 0 0 0 False False
31 2020-02-11 0 0 0 False True
32 2020-02-12 0 0 0 False True
33 2020-02-13 0 0 1 True True
34 2020-02-14 0 0 1 True True
35 2020-02-15 0 0 1 True True
36 2020-02-16 0 0 1 True True
37 2020-02-17 0 0 0 False True
38 2020-02-18 0 0 0 False True
39 2020-02-19 0 0 0 False False
40 2020-02-20 0 0 0 False False
41 2020-02-21 0 0 0 False False
42 2020-02-22 0 0 0 False False
43 2020-02-23 0 0 0 False False
44 2020-02-24 0 0 0 False False
CodePudding user response:
import numpy as np
import pandas as pd
from pandas.tseries.offsets import DateOffset
date_range = pd.date_range(start = pd.to_datetime("2020-01-10") DateOffset(days=1), periods = 45, freq = 'D').to_list()
peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
jellyday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0]
holiday_dict = {'date':date_range,
'peanutbutterday':peanutbutterday,
'jellyday':jellyday,
'crackerday':crackerday}
df = pd.DataFrame.from_dict(holiday_dict)
# Grab all the holidays
holidays = df.loc[df[df.columns[1:]].sum(axis = 1) > 0, 'date'].values
# Subtract every day by every holiday and get the absolute time difference in days
days_from_holiday = np.subtract.outer(df.date.values, holidays)
days_from_holiday = np.min(np.abs(days_from_holiday), axis = 1)
days_from_holiday = np.array(days_from_holiday, dtype = 'timedelta64[D]')
# Make comparison
df['near_holiday'] = days_from_holiday <= np.timedelta64(2, 'D')
# If you want it to read 0 or 1
df['near_holiday'] = df['near_holiday'].astype('int')
print(df)
First, we need to grab all the holidays. If we sum across all the holiday columns, then any rows with a sum > 0 is a holiday and we pull that date.
Then, we subtract every day by every holiday, which is quickly done using np.subtract.outer. Then we find the minimum of the absolute value to see the closest time to a holiday a date has. Then we just convert it to days because the default unit is nanoseconds. After that, it's just a matter of making the comparison and assigning it to the column.