Home > Net >  Pandas: Efficient ways to calculate occupancy rate between check-in and check-out dates
Pandas: Efficient ways to calculate occupancy rate between check-in and check-out dates

Time:05-24

I have the following toy data

a = pd.DataFrame({"chkin": ["2022-05-22", "2022-05-22", "2022-05-23", "2022-05-24"],
                  "chkout": ["2022-05-25", "2022-05-23", "2022-05-26", "2022-05-27"],
                  "rtype": ["A", "A", "A", "A"],
                  "nbooked": [1, 2, 3, 1],
                  "nrooms": [10, 10, 10, 10]})

b = pd.DataFrame({"chkin": ["2022-05-22", "2022-05-23", "2022-05-23", "2022-05-24"],
                  "chkout": ["2022-05-24", "2022-05-26", "2022-05-24", "2022-05-25"],
                  "rtype": ["B", "B", "B", "B"],
                  "nbooked": [2, 1, 1, 3],
                  "nrooms": [12, 12, 12, 12]})

booking = pd.concat([a, b], axis=0, ignore_index=True, sort=False)
booking["chkin"] = pd.to_datetime(booking["chkin"])
booking["chkout"] = pd.to_datetime(booking["chkout"])

My problem is explained in the following figure

enter image description here

The nbooked refers to the number of booked rooms. Each color bar on calendar represents each row of input data. I would like to calculate occupancy rate for each day starts from the earliest check-in date to the last check-in date. (Assume that occupancy rate for each room type are zeros).

Since each day is possible to have check-in and check-out as shown in the calendar. Combining nbooked directly should not guarantee to get correct answer. May I have suggestions how to effectively calculate

CodePudding user response:

You could create a date range and then explode it, allowing you to groupby and sum for each day. The creation of the date range and explode will be a bit slow if your DataFrame is long.

This will also only give you dates in the output for which the occupancy is non-zero. If you also need the zeroes, reindex over the list of dates you care about.

booking['chkout_2'] = booking.chkout - pd.offsets.DateOffset(days=1)
booking['date'] = booking.apply(lambda r: pd.date_range(r.chkin, r.chkout_2, freq='D'), axis=1)

res = (booking.set_index(['rtype', 'nbooked', 'nrooms'])
          .explode('date')
          .reset_index()
          .groupby(['rtype', 'date'])
          .agg({'nbooked': 'sum', 'nrooms': 'max'}))

res['occ'] = res['nbooked']/res['nrooms']

print(res)

              nbooked  nrooms       occ
rtype date                                
A     2022-05-22        3      10  0.300000
      2022-05-23        4      10  0.400000
      2022-05-24        5      10  0.500000
      2022-05-25        4      10  0.400000
      2022-05-26        1      10  0.100000
B     2022-05-22        2      12  0.166667
      2022-05-23        4      12  0.333333
      2022-05-24        4      12  0.333333
      2022-05-25        1      12  0.083333

Another option, which may be more performant if you have a small set of dates that are relevant for each 'rtype', is to do a cross join to all dates and then filter down to the rows you care about. Output is identical to the above.

# Daily df of relevant dates
df_dates = pd.DataFrame({'date': pd.date_range('2022-05-22', '2022-05-25', freq='D')})

res = (booking.merge(df_dates, how='cross')
              .query('date >= chkin & date < chkout')
              .groupby(['rtype', 'date'])
              .agg({'nbooked': 'sum', 'nrooms': 'max'}))

res['occ'] = res['nbooked']/res['nrooms']
  • Related