I have pandas dataframe with a non-continuous date index (missing are weekends and holidays). I want to add column which would contain number of days until next day off.
Here is code generating example dataframe with desired values in till_day_off column:
import pandas as pd
df = pd.DataFrame(index=pd.date_range(start="2022-06-06", periods=15))
df["day_of_week"] = df.index.dayofweek # adding column with number of day in a week
df = df[(df.day_of_week < 5)] # remove weekends
df = df.drop(index="2022-06-15") # remove Wednesday in second week
df["till_day_off"] = [5,4,3,2,1,2,1,2,1,1] # desired values, end of column is treated as day off
Resulting dataframe:
day_of_week | till_day_off | |
---|---|---|
2022-06-06 | 0 | 5 |
2022-06-07 | 1 | 4 |
2022-06-08 | 2 | 3 |
2022-06-09 | 3 | 2 |
2022-06-10 | 4 | 1 |
2022-06-13 | 0 | 2 |
2022-06-14 | 1 | 1 |
2022-06-16 | 3 | 2 |
2022-06-17 | 4 | 1 |
2022-06-20 | 0 | 1 |
Real dataframe has over 7_000 rows so obviously I am trying to avoid iteration over rows. Any idea how to tackle the issue?
CodePudding user response:
Assuming a sorted input (if not, sort it by days), you can use a mask to identify consecutive days and use it to group them and compute a cumcount:
mask = (-df.index.to_series().diff(-1)).eq('1d').iloc[::-1]
# reversing the Series to count until (not since) the value
df['till_day_off'] = mask.groupby((~mask).cumsum()).cumcount().add(1)
output:
day_of_week till_day_off
2022-06-06 0 5
2022-06-07 1 4
2022-06-08 2 3
2022-06-09 3 2
2022-06-10 4 1
2022-06-13 0 2
2022-06-14 1 1
2022-06-16 3 2
2022-06-17 4 1
2022-06-20 0 1
intermediates:
mask
2022-06-20 False
2022-06-17 False
2022-06-16 True
2022-06-14 False
2022-06-13 True
2022-06-10 False
2022-06-09 True
2022-06-08 True
2022-06-07 True
2022-06-06 True
dtype: bool
(~mask).cumsum()
2022-06-20 1
2022-06-17 2
2022-06-16 2
2022-06-14 3
2022-06-13 3
2022-06-10 4
2022-06-09 4
2022-06-08 4
2022-06-07 4
2022-06-06 4
dtype: int64
CodePudding user response:
If I understand your question correctly, you could use apply
def calc_to_days_off(day):
#do your calculation here
df["till_day_off"] = df["day_of_week"].apply(calc_to_days_off)
CodePudding user response:
Create a DataFrame of the missing dates, then use an asof
merge to match with the closest one in the future and calculate the time until that day off.
import pandas as pd
# DataFrame of missing dates, e.g. days off
df1 = pd.DataFrame({'day_off': pd.date_range('2022-06-06', periods=200)})
df1 = df1[~df1['day_off'].isin(df.index)]
df = pd.merge_asof(df, df1, left_index=True, right_on='day_off', direction='forward')
df['till_day_off'] = (df['day_off'] - df.index).dt.days
print(df)
day_of_week day_off till_day_off
2022-06-06 0 2022-06-11 5
2022-06-07 1 2022-06-11 4
2022-06-08 2 2022-06-11 3
2022-06-09 3 2022-06-11 2
2022-06-10 4 2022-06-11 1
2022-06-13 0 2022-06-15 2
2022-06-14 1 2022-06-15 1
2022-06-16 3 2022-06-18 2
2022-06-17 4 2022-06-18 1
2022-06-20 0 2022-06-21 1