Home > other >  Update pandas dataframe column based on date column via list of datetimes
Update pandas dataframe column based on date column via list of datetimes

Time:08-09

Old question

Please refer to the above question for details. I need to add 0.5 business days to the business_days column for every holiday in the second list that is not in the first. Here is an example input df called predicted_df:

PredictionTargetDateEOM business_days
0       2022-06-30      22
1       2022-06-30      22
2       2022-06-30      22
3       2022-06-30      22
4       2022-06-30      22
        ... ... ...
172422  2022-11-30      21
172423  2022-11-30      21
172424  2022-11-30      21
172425  2022-11-30      21
172426  2022-11-30      21

The PredictionTargetDateEOM is just the last day of the month. business_days refers to the number of business days in that month, and should be the same for all the rows within that month. Here are two lists of holidays. For the holidays that are present in the second list but not the first, the business_days column should have 0.5 added to it for every row of the dataframe that the month for that holiday appears.

rocket_holiday = ["New Year's Day", "Martin Luther King Jr. Day", "Memorial Day", "Independence Day",
                 "Labor Day", "Thanksgiving", "Christmas Day"]
rocket_holiday_including_observed = rocket_holiday   [item   ' (Observed)' for item in rocket_holiday]
print(rocket_holiday_including_observed)
["New Year's Day",
 'Martin Luther King Jr. Day',
 'Memorial Day',
 'Independence Day',
 'Labor Day',
 'Thanksgiving',
 'Christmas Day',
 "New Year's Day (Observed)",
 'Martin Luther King Jr. Day (Observed)',
 'Memorial Day (Observed)',
 'Independence Day (Observed)',
 'Labor Day (Observed)',
 'Thanksgiving (Observed)',
 'Christmas Day (Observed)']
banker_hols = [i for i in holidays.US(years = 2022).values()]
print(banker_hols)
2022-01-01 New Year's Day
2022-01-17 Martin Luther King Jr. Day
2022-02-21 Washington's Birthday
2022-05-30 Memorial Day
2022-06-19 Juneteenth National Independence Day
2022-06-20 Juneteenth National Independence Day (Observed)
2022-07-04 Independence Day
2022-09-05 Labor Day
2022-10-10 Columbus Day
2022-11-11 Veterans Day
2022-11-24 Thanksgiving
2022-12-25 Christmas Day
2022-12-26 Christmas Day (Observed)

The second list is actually derived from a dictionary via:

import holidays
for name, date in holidays.US(years=2022).items():
    print(name, date)

Which in raw looks like this:

{datetime.date(2022, 1, 1): "New Year's Day", datetime.date(2022, 1, 17): 'Martin Luther King Jr. Day', datetime.date(2022, 2, 21): "Washington's Birthday", datetime.date(2022, 5, 30): 'Memorial Day', datetime.date(2022, 6, 19): 'Juneteenth National Independence Day', datetime.date(2022, 6, 20): 'Juneteenth National Independence Day (Observed)', datetime.date(2022, 7, 4): 'Independence Day', datetime.date(2022, 9, 5): 'Labor Day', datetime.date(2022, 10, 10): 'Columbus Day', datetime.date(2022, 11, 11): 'Veterans Day', datetime.date(2022, 11, 24): 'Thanksgiving', datetime.date(2022, 12, 25): 'Christmas Day', datetime.date(2022, 12, 26): 'Christmas Day (Observed)'}

The following is an example output to show the desired outcome:

PredictionTargetDateEOM business_days
0       2022-06-30      22.5
1       2022-06-30      22.5
2       2022-06-30      22.5
3       2022-06-30      22.5
4       2022-06-30      22.5
        ... ... ...
172422  2022-11-30      21.5
172423  2022-11-30      21.5
172424  2022-11-30      21.5
172425  2022-11-30      21.5
172426  2022-11-30      21.5

As you can see, since Juneteenth and Veterans Day are in the second list, but not the first, I would add 0.5 days to the 'business_days' column for each row that contains June and November as the month. However, for other months like July or January where the holidays are shared between the two lists, the business_days column for those months should be unchanged. Lastly, this method should be robust for backfilling historical data from previous years as well. I have tried the following method but it does not perform as needed. It will either remove entire months from the dataframe, or for the months that it doesn't remove, not alter the business_days elements for the months I need it to.

main_list = list(set(banker_hols) - set(rocket_holiday_including_observed))
print(main_list)

['Columbus Day',
 'Juneteenth National Independence Day',
 "Washington's Birthday",
 'Juneteenth National Independence Day (Observed)',
 'Veterans Day']

result = []
for key, value in holidays.US(years = 2022).items():
    if value in main_list:
        result.append(key)
print(result)

[datetime.date(2022, 2, 21),
 datetime.date(2022, 6, 19),
 datetime.date(2022, 6, 20),
 datetime.date(2022, 10, 10),
 datetime.date(2022, 11, 11)]

So I have the months I need to add 0.5 business days to, but I'm not sure how to update the business_days column in the dataframe for all of the rows that fall into those months.

EDIT problem solved here: Add quantity to pandas column if row condition is met

My answer that incorporates key .loc() function shown in linked question:

#Identify holidays in banker list not in rocket list
banker_hols = [i for i in holidays.US(years = 2022).values()]
hol_diffs = list(set(banker_hols) - set(rocket_holiday_including_observed))

#Extract dates of those holidays
dates_of_hols = []
for key, value in holidays.US(years = 2022).items():
    if value in hol_diffs:
        dates_of_hols.append(key)

#Extract just the months of those holidays
months = []
for item in dates_of_hols:
    months.append(item.month)
months = list(set(months))

#Add 0.5 to business_days for those months
predicted_df.loc[predicted_df['PredictionTargetDateEOM'].dt.month.isin(months), 'business_days']  = 0.5

CodePudding user response:

We only need the dates of the relevant holidays:

relevant_holidays = {
    x: y for x, y in holidays.US(years=2022).items() 
    if y not in rocket_holiday_including_observed
}

We get the corresponding month-end date using pandas magic:

holiday_month_end = pd.to_datetime(
    list(relevant_holidays.keys())
).to_period("M").to_timestamp("M")
DatetimeIndex(['2022-02-28', '2022-06-30', '2022-06-30', '2022-10-31',
               '2022-11-30'],
              dtype='datetime64[ns]', freq=None)

Before joining, we count them for each month and multiply by 0.5:

to_add = holiday_month_end.value_counts() * 0.5
2022-06-30    1.0
2022-02-28    0.5
2022-10-31    0.5
2022-11-30    0.5
dtype: float64

The index is now unique. To align it to the dataframe, use reindex:

predicted_df["business_days"] = predicted_df["business_days"]   to_add.reindex(
    pd.to_datetime(predicted_df["PredictionTargetDateEOM"])
).fillna(0).values

The fillna is necessary as to_add does not have entries for every month. The values is necessary to get rid of the index, otherwise the would try to match index values instead of keeping the order.

CodePudding user response:

Here is the more modular, pythonic solution:

my_list = [
    "New Year's Day",
    "Martin Luther King Jr. Day",
    "Memorial Day",
    "Independence Day",
    "Labor Day",
    "Thanksgiving",
    "Christmas Day",
    "New Year's Day (Observed)",
    "Martin Luther King Jr. Day (Observed)",
    "Memorial Day (Observed)",
    "Independence Day (Observed)",
    "Labor Day (Observed)",
    "Thanksgiving (Observed)",
    "Christmas Day (Observed)",
]

# to speed up the search
my_set = set(my_list)

predicted_df['business_days_bankers'] = predicted_df.apply(lambda x: np.busday_count(x['PredictionTargetDateBOM'].date(), x['DayAfterTargetDateEOM'].date(), holidays=[k for k,v in holidays.US(years=x['PredictionTargetDateBOM'].year).items()]), axis = 1)
predicted_df['business_days_rocket'] = predicted_df.apply(lambda x: np.busday_count(x['PredictionTargetDateBOM'].date(), x['DayAfterTargetDateEOM'].date(), holidays=[k for k, v in holidays.US(years=x['PredictionTargetDateBOM'].year).items() if v in my_set]), axis = 1)`

cols = ['business_days_bankers', 'business_days_rocket']
predicted_df['business_days_final'] = predicted_df[cols].mean(axis = 1)
  • Related