Home > Back-end >  Python - Alternatives to df.iterrows() with better perfomance
Python - Alternatives to df.iterrows() with better perfomance

Time:03-11

I have the following dataframe:

Level2 Level 3 DateStart DateEnd
Monthly 1 2022-01-01 2022-01-01
Monthly 2 2022-01-01 2022-01-01
Monthly 5 2022-01-01 2022-01-01
Semi-annual H1 2022-01-01 2022-01-01
Semi-annual H2 2022-01-01 2022-01-01
Quarterly Q1 2022-01-01 2022-01-01
Quarterly Q3 2022-01-01 2022-01-01
Quarterly Q4 2022-01-01 2022-01-01

Initially all the 'DateStart' and 'DateEnd' datetimes are set to 2022-01-01 by default and I need to adjust them according to Level2 and Level3 Columns. I can do this with df.iterrows() succesfully but it takes ages for the script to run as there are hundreds of thousands of rows. This is my code:

for i, row in df.iterrows():
if df.loc[i, 'Level2'] == 'Monthly':
    df.loc[i, 'DateStart'] = df.loc[i, 'DateStart']   relativedelta(months = int(df['Level3'][i]) - 1)
    df.loc[i, 'DateEnd'] = df.loc[i, 'DateStart']   relativedelta(months = 1, days=-1)
elif df.loc[i, 'Level2'] == 'Quarterly':
    df.loc[i, 'DateStart'] = df.loc[i, 'DateStart']   relativedelta(months = (int(df['Level3'][i][-1]) * 3) - 3)
    df.loc[i, 'DateEnd'] = df.loc[i, 'DateStart']   relativedelta(months = 3, days=-1)
elif df.loc[i, 'Level2'] == 'Semi-annual':
    df.loc[i, 'DateStart'] = df.loc[i, 'DateStart']   relativedelta(months = (int(df['Level3'][i][-1]) * 6) - 6)
    df.loc[i, 'DateEnd'] = df.loc[i, 'DateStart']   relativedelta(months = 6, days=-1)
else:
    df.loc[i, 'DateEnd'] = df.loc[i, 'DateStart']   relativedelta(years=1, days=-1)

This is what we need the outcome to be in this case:

Level2 Level 3 DateStart DateEnd
Monthly 1 2022-01-01 2022-01-31
Monthly 2 2022-02-01 2022-02-28
Monthly 5 2022-05-01 2022-05-31
Semi-annual H1 2022-01-01 2022-06-30
Semi-annual H2 2022-07-01 2022-12-31
Quarterly Q1 2022-01-01 2022-03-31
Quarterly Q3 2022-07-01 2022-09-30
Quarterly Q4 2022-10-01 2022-12-31

Any help would be greatly appreciated to make this process faster

CodePudding user response:

A few observations:

  • The “Level2” column is redundant, since the values in “Level3” distinguish between the different period lengths.

  • There are only 12 4 2=18 possible values for StartDate, and likewise for EndDate.

Therefore, it would be simplest to just precalculate all 18 possible values for StartDate and EndDate, and store them in a dict.

Then use:

df[“StartDate”] = df[“Level3”].map(start_dict)
df[“EndDate”] = df[“Level3”].map(end_dict)

CodePudding user response:

You can use the groupby function of pandas and then the aggregation.

import pandas as pd
import numpy as np
data = {
'Level2': ['Monthly', 'Monthly', 'Monthly', 'Monthly', 'Monthly', 'Monthly', 'Semi-annual', 'Semi-annual', 'Semi-annual'],
'Level 3': ['1', '1', '1', '2', '2', '2', 'H1', 'H1', 'H1'],
'DateStart': ['2022-01-01', '2022-01-01', '2022-01-01', '2022-01-01', '2022-01-01', '2022-01-01', '2022-01-01', '2022-01-01', '2022-01-01'],
'DateEnd': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-01', '2022-01-02', '2022-01-03', '2022-01-01', '2022-05-02', '2022-06-30']
}


df = pd.DataFrame(data)
df_grouped = df.groupby(['Level2', 'Level 3'])
df_res = df_grouped.agg({'DateStart': np.min, 'DateEnd': np.max})
  • Related