I have the following dataframe:
Level2 | Level 3 | DateStart | DateEnd |
---|---|---|---|
Monthly | 1 | 2022-01-01 | 2022-01-01 |
Monthly | 2 | 2022-01-01 | 2022-01-01 |
Monthly | 5 | 2022-01-01 | 2022-01-01 |
Semi-annual | H1 | 2022-01-01 | 2022-01-01 |
Semi-annual | H2 | 2022-01-01 | 2022-01-01 |
Quarterly | Q1 | 2022-01-01 | 2022-01-01 |
Quarterly | Q3 | 2022-01-01 | 2022-01-01 |
Quarterly | Q4 | 2022-01-01 | 2022-01-01 |
Initially all the 'DateStart' and 'DateEnd' datetimes are set to 2022-01-01 by default and I need to adjust them according to Level2 and Level3 Columns. I can do this with df.iterrows() succesfully but it takes ages for the script to run as there are hundreds of thousands of rows. This is my code:
for i, row in df.iterrows():
if df.loc[i, 'Level2'] == 'Monthly':
df.loc[i, 'DateStart'] = df.loc[i, 'DateStart'] relativedelta(months = int(df['Level3'][i]) - 1)
df.loc[i, 'DateEnd'] = df.loc[i, 'DateStart'] relativedelta(months = 1, days=-1)
elif df.loc[i, 'Level2'] == 'Quarterly':
df.loc[i, 'DateStart'] = df.loc[i, 'DateStart'] relativedelta(months = (int(df['Level3'][i][-1]) * 3) - 3)
df.loc[i, 'DateEnd'] = df.loc[i, 'DateStart'] relativedelta(months = 3, days=-1)
elif df.loc[i, 'Level2'] == 'Semi-annual':
df.loc[i, 'DateStart'] = df.loc[i, 'DateStart'] relativedelta(months = (int(df['Level3'][i][-1]) * 6) - 6)
df.loc[i, 'DateEnd'] = df.loc[i, 'DateStart'] relativedelta(months = 6, days=-1)
else:
df.loc[i, 'DateEnd'] = df.loc[i, 'DateStart'] relativedelta(years=1, days=-1)
This is what we need the outcome to be in this case:
Level2 | Level 3 | DateStart | DateEnd |
---|---|---|---|
Monthly | 1 | 2022-01-01 | 2022-01-31 |
Monthly | 2 | 2022-02-01 | 2022-02-28 |
Monthly | 5 | 2022-05-01 | 2022-05-31 |
Semi-annual | H1 | 2022-01-01 | 2022-06-30 |
Semi-annual | H2 | 2022-07-01 | 2022-12-31 |
Quarterly | Q1 | 2022-01-01 | 2022-03-31 |
Quarterly | Q3 | 2022-07-01 | 2022-09-30 |
Quarterly | Q4 | 2022-10-01 | 2022-12-31 |
Any help would be greatly appreciated to make this process faster
CodePudding user response:
A few observations:
The “Level2” column is redundant, since the values in “Level3” distinguish between the different period lengths.
There are only 12 4 2=18 possible values for StartDate, and likewise for EndDate.
Therefore, it would be simplest to just precalculate all 18 possible values for StartDate and EndDate, and store them in a dict.
Then use:
df[“StartDate”] = df[“Level3”].map(start_dict)
df[“EndDate”] = df[“Level3”].map(end_dict)
CodePudding user response:
You can use the groupby function of pandas and then the aggregation.
import pandas as pd
import numpy as np
data = {
'Level2': ['Monthly', 'Monthly', 'Monthly', 'Monthly', 'Monthly', 'Monthly', 'Semi-annual', 'Semi-annual', 'Semi-annual'],
'Level 3': ['1', '1', '1', '2', '2', '2', 'H1', 'H1', 'H1'],
'DateStart': ['2022-01-01', '2022-01-01', '2022-01-01', '2022-01-01', '2022-01-01', '2022-01-01', '2022-01-01', '2022-01-01', '2022-01-01'],
'DateEnd': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-01', '2022-01-02', '2022-01-03', '2022-01-01', '2022-05-02', '2022-06-30']
}
df = pd.DataFrame(data)
df_grouped = df.groupby(['Level2', 'Level 3'])
df_res = df_grouped.agg({'DateStart': np.min, 'DateEnd': np.max})