i would like some help with optimizing a piece of code. First, I will provide some context. I have a DataFrame with 23k rows and 33 columns which represents reservations and another dataframe with 2k rows and 50 columns which represent properties. What I would like happenning is to understand monthly GBV (Gross Booking Value) and monthly booked nights. The tricky part is that "monthly" inclines the following:
If a reservation is from 25th of a month until the 5th of the next month and the ADR (Average Daily Rate) is 50 dollars, the result of this reservation is going to be the following:
Nights Booked in 1st month: 7 nights (from 25th to 31st) Nights Booked in 2nd month: 5 nights (from 1st to 5th) GBV for 1st month: 7 nights * 50 dollars = 350 dollars GBV for 2nd month: 5 nights * 50 dollars = 250 dollars
Sample data would be: In reservation dataframe: Arrival_date, Departure_date, Accommodation_amount In properties dataframe: GBV column for each month and Nights booked column for each month
Here is the piece of code that I am using currently:
i=0
while i < len(kpi):
if kpi['cod_reservation_status'][i] == 'CONF':
dateRange = pd.date_range(kpi['arrival'][i], kpi['departure'][i], freq='d', closed='left')
index = prop_index(int(kpi['id_room'][i]), props.id_room_type)
nGBV = kpi['accommodation_total_amount'][i] / len(dateRange)
for adate in dateRange:
if adate.month == 1 and adate.year == 2022:
props['January Booked'][index] = props['January Booked'][index] 1
props['January GBV'][index] = props['January GBV'][index] nGBV
if adate.month == 2 and adate.year == 2022:
props['February Booked'][index] = props['February Booked'][index] 1
props['February GBV'][index] = props['February GBV'][index] nGBV
if adate.month == 3 and adate.year == 2022:
props['March Booked'][index] = props['March Booked'][index] 1
props['March GBV'][index] = props['March GBV'][index] nGBV
if adate.month == 4 and adate.year == 2022:
props['April Booked'][index] = props['April Booked'][index] 1
props['April GBV'][index] = props['April GBV'][index] nGBV
if adate.month == 5 and adate.year == 2022:
props['May Booked'][index] = props['May Booked'][index] 1
props['May GBV'][index] = props['May GBV'][index] nGBV
if adate.month == 6 and adate.year == 2022:
props['June Booked'][index] = props['June Booked'][index] 1
props['June GBV'][index] = props['June GBV'][index] nGBV
if adate.month == 7 and adate.year == 2022:
props['July Booked'][index] = props['July Booked'][index] 1
props['July GBV'][index] = props['July GBV'][index] nGBV
if adate.month == 8 and adate.year == 2022:
props['August Booked'][index] = props['August Booked'][index] 1
props['August GBV'][index] = props['August GBV'][index] nGBV
if adate.month == 9 and adate.year == 2022:
props['September Booked'][index] = props['September Booked'][index] 1
props['September GBV'][index] = props['September GBV'][index] nGBV
if adate.month == 10 and adate.year == 2022:
props['October Booked'][index] = props['October Booked'][index] 1
props['October GBV'][index] = props['October GBV'][index] nGBV
if adate.month == 11 and adate.year == 2022:
props['November Booked'][index] = props['November Booked'][index] 1
props['November GBV'][index] = props['November GBV'][index] nGBV
if adate.month == 12 and adate.year == 2022:
props['December Booked'][index] = props['December Booked'][index] 1
props['December GBV'][index] = props['December GBV'][index] nGBV
i =1
That works and does the job BUT it is quite slow. On 22k entries, it takes me 2 and a half minutes. I am sure there is a way to optimize this. Any ideas are welcome. What are the best practices in optimizing loops and how would you go about this specific task?
PS: kpi = reservations DataFrame, props = properties DataFrame
CodePudding user response:
With pandas you must not use while and for loop where it is possible to use pandas built-in function which are written in C code and compiled (quicker than interpreted).
Try to uses props.loc[:,"column"].apply("the function you want to apply")
With pandas you must think the entire dataset and not iterate objects.
Look into the documnetation for df.loc[row,col] and df.apply()
CodePudding user response:
I noticed that you have this:
i=0
while i < len(kpi):
A while
loop is slower than a for
loop. Because a while loop checks the condition after each iteration.
So you should replace it with this:
for i in range(len(kpi))