I got a question about how to label specific days and other days close to it. I have two pandas dataframes, one with Canadian holidays like the one below:
ds holiday
2019-12-25 Christmas Day
2019-12-26 Boxing Day
2020-01-01 New Year's Day
2020-02-17 Family Day
2020-04-10 Good Friday
and another data frame like the one below:
ds y
2020-01-01 5
2020-01-02 941
2020-01-03 633
2020-01-04 633
2020-01-05 633
2020-01-06 633
. .
. .
. .
I want to create a new column in my second dataframe, in which I label as 1 the days in common and also the 3 days before and after the dates in common. In poor words, I want to label all the holidays in my df and also the 3 days before and after the holidays.
I would like something like this as output:
ds y label
2020-01-01 5 1
2020-01-02 941 1
2020-01-03 633 1
2020-01-04 633 1
2020-01-05 633 0
2020-01-06 633 0
. . .
. . .
. . .
I was wondering how I can do it.
Thanks for your help and have a great day!
CodePudding user response:
Here is a sample solution:
import pandas as pd
import numpy as np
import datetime
df1 = pd.DataFrame({'ds': ['2019-12-25', '2019-12-26', '2020-01-01', '2020-02-17', '2020-04-10'],
'holiday': ["Christmas Day", 'Boxing Day' ,"New Year's Day" ,"Family Day" ,"Good Friday"]} )
df2 = pd.DataFrame({'ds': ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05', '2020-01-06'],
'y': [5, 941, 633, 633, 633, 633]})
df1['ds'] = df1['ds'].astype('datetime64')
df2['ds'] = df2['ds'].astype('datetime64')
#convert to dic for fast look up - 3 days
dic = {}
for i in range(df1.shape[0]):
dic[df1['ds'][i]] = 1
dic[df1['ds'][i] datetime.timedelta(days=-3)] = 1
dic[df1['ds'][i] datetime.timedelta(days=-2)] = 1
dic[df1['ds'][i] datetime.timedelta(days=-1)] = 1
dic[df1['ds'][i] datetime.timedelta(days=1)] = 1
dic[df1['ds'][i] datetime.timedelta(days=2)] = 1
dic[df1['ds'][i] datetime.timedelta(days=3)] = 1
df2['label'] = np.zeros(df2.shape[0], dtype=np.int32)
for i in range(df2.shape[0]):
if df2['ds'][i] in dic:
df2['label'][i] = 1
CodePudding user response:
This worked fine for me
def get_holiday_range(df, day_range):
df['holiday_range'] = 0
for holiday in holidays['ds'].values :
holiday_range = np.arange(holiday - np.timedelta64(day_range,'D'), holiday np.timedelta64(day_range,'D'), np.timedelta64(1, "D"))
df.loc[df.call_datetime_date.isin(holiday_range), 'holiday_range'] = 1
return df