Home > Software design >  How do I create a Boolean column that places a flag on two days before and two days after a holiday?
How do I create a Boolean column that places a flag on two days before and two days after a holiday?

Time:01-18

I have a data frame with Boolean columns denoting holidays. I woudl like to add another Boolean column that flags the two days before any column and two days after, for any holiday column.

For example, take the data below:

    import pandas as pd
    from pandas.tseries.offsets import DateOffset
    date_range = pd.date_range(start = pd.to_datetime("2020-01-10")   DateOffset(days=1), periods = 45, freq = 'D').to_list()
    peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
    jellyday =  [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
    crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0]

    holiday_dict = {'date':date_range, 
                'peanutbutterday':peanutbutterday,
                'jellyday':jellyday,
                'crackerday':crackerday}
    df = pd.DataFrame.from_dict(holiday_dict) 

What I would expect is an additional column titled below as holiday_bookend that looks like the following:

    import pandas as pd
    from pandas.tseries.offsets import DateOffset
    date_range = pd.date_range(start = pd.to_datetime("2020-01-10")   DateOffset(days=1), periods = 45, freq = 'D').to_list()
    peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
    jellyday =  [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
    crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0]
    holiday_bookend = [0,0,0,0,0,1,1,0,0,1,1,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,0,0,0,0,0,0]

    holiday_dict = {'date':date_range, 
                'peanutbutterday':peanutbutterday,
                'jellyday':jellyday,
                'crackerday':crackerday,
               'holiday_bookend':holiday_bookend}
    df = pd.DataFrame.from_dict(holiday_dict)

I'm not sure if I should try with a loop. I haven't conceptually worked that out so I'm kind of stuck.

I tried to incorporate the suggestion from here: How To Identify days before and after a holiday within pandas? but it seemed I needed to put a column for each holiday. I need one column that takes into account all holiday columns.

CodePudding user response:

basically add two extra columns:

  • detect when a holiday has occurred (use any method).
  • two days before and two days after (use shift method).

The columns work like this:

  • The any method contains all the holiday days.
  • The shift method has -2 and 2 for 2 day shifting.

side note:

avoid using for loops inside a pandas dataframe. the vectorised methods will always be faster and preferable.

So you can do this:

import pandas as pd
from pandas.tseries.offsets import DateOffset
date_range = pd.date_range(start = pd.to_datetime("2020-01-10")   DateOffset(days=1), periods = 45, freq = 'D').to_list()
peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
jellyday =  [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0]


holiday_dict = {'date':date_range, 
            'peanutbutterday':peanutbutterday,
            'jellyday':jellyday,
            'crackerday':crackerday}
df = pd.DataFrame.from_dict(holiday_dict)

# add extra colums
df["holiday"] = df[["peanutbutterday", "jellyday", "crackerday"]].any(axis=1).astype(bool)

# column with 2 days before and 2 days after
df["holiday_extended"] = df["holiday"] | df["holiday"].shift(-2) | df["holiday"].shift(2)

which returns this:

    date    peanutbutterday jellyday    crackerday  holiday holiday_extended
0   2020-01-11  0   0   0   False   False
1   2020-01-12  0   0   0   False   False
2   2020-01-13  0   0   0   False   False
3   2020-01-14  0   0   0   False   False
4   2020-01-15  0   0   0   False   False
5   2020-01-16  0   0   0   False   True
6   2020-01-17  0   0   0   False   True
7   2020-01-18  1   0   0   True    True
8   2020-01-19  1   0   0   True    True
9   2020-01-20  0   0   0   False   True
10  2020-01-21  0   0   0   False   True
11  2020-01-22  0   0   0   False   True
12  2020-01-23  0   0   0   False   True
13  2020-01-24  0   1   1   True    True
14  2020-01-25  0   1   1   True    True
15  2020-01-26  0   0   0   False   True
16  2020-01-27  0   0   0   False   True
17  2020-01-28  0   0   0   False   False
18  2020-01-29  0   0   0   False   False
19  2020-01-30  0   0   0   False   False
20  2020-01-31  0   0   0   False   False
21  2020-02-01  0   0   0   False   False
22  2020-02-02  0   0   0   False   False
23  2020-02-03  0   0   0   False   False
24  2020-02-04  0   0   0   False   False
25  2020-02-05  0   0   0   False   False
26  2020-02-06  0   0   0   False   False
27  2020-02-07  0   0   0   False   False
28  2020-02-08  0   0   0   False   False
29  2020-02-09  0   0   0   False   False
30  2020-02-10  0   0   0   False   False
31  2020-02-11  0   0   0   False   True
32  2020-02-12  0   0   0   False   True
33  2020-02-13  0   0   1   True    True
34  2020-02-14  0   0   1   True    True
35  2020-02-15  0   0   1   True    True
36  2020-02-16  0   0   1   True    True
37  2020-02-17  0   0   0   False   True
38  2020-02-18  0   0   0   False   True
39  2020-02-19  0   0   0   False   False
40  2020-02-20  0   0   0   False   False
41  2020-02-21  0   0   0   False   False
42  2020-02-22  0   0   0   False   False
43  2020-02-23  0   0   0   False   False
44  2020-02-24  0   0   0   False   False

CodePudding user response:

import numpy as np
import pandas as pd
from pandas.tseries.offsets import DateOffset
date_range = pd.date_range(start = pd.to_datetime("2020-01-10")   DateOffset(days=1), periods = 45, freq = 'D').to_list()
peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
jellyday =  [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0]

holiday_dict = {'date':date_range, 
            'peanutbutterday':peanutbutterday,
            'jellyday':jellyday,
            'crackerday':crackerday}
df = pd.DataFrame.from_dict(holiday_dict)

# Grab all the holidays
holidays = df.loc[df[df.columns[1:]].sum(axis = 1) > 0, 'date'].values

# Subtract every day by every holiday and get the absolute time difference in days

days_from_holiday = np.subtract.outer(df.date.values, holidays)
days_from_holiday = np.min(np.abs(days_from_holiday), axis = 1)
days_from_holiday = np.array(days_from_holiday, dtype = 'timedelta64[D]')

# Make comparison

df['near_holiday'] = days_from_holiday <= np.timedelta64(2, 'D')

# If you want it to read 0 or 1
df['near_holiday'] = df['near_holiday'].astype('int')

print(df)

First, we need to grab all the holidays. If we sum across all the holiday columns, then any rows with a sum > 0 is a holiday and we pull that date.

Then, we subtract every day by every holiday, which is quickly done using np.subtract.outer. Then we find the minimum of the absolute value to see the closest time to a holiday a date has. Then we just convert it to days because the default unit is nanoseconds. After that, it's just a matter of making the comparison and assigning it to the column.

  • Related