Calculate cumulative ocupation from variable by date ranges (summation)-CodePudding

Let it be the following python pandas DataFrame where each row represents a person's stay in a hotel.

| entry_date | exit_date  | days   | other_columns |
| ---------- | ---------- | ------ | ------------- |
| 2022-02-01 | 2022-02-05 | 5      |  ...          |
| 2022-02-02 | 2022-02-03 | 2      |  ...          |
| 2022-04-10 | 2022-04-13 | 4      |  ...          |
| 2022-04-11 | 2022-04-12 | 2      |  ...          |
| 2022-04-12 | 2022-04-13 | 2      |  ...          |
| 2022-11-10 | 2022-11-15 | 6      |  ...          |

I want to make a DataFrame from the previous one, where it represents for each day, the occupancy of the hotel at that moment. I am not taking into account the nights, just the days variable.

| date       | ocupation  |
| ---------- | ---------- |
| 2022-02-01 |     1      |
| 2022-02-02 |     2      |
| 2022-02-03 |     2      |
| 2022-02-04 |     1      |
| 2022-02-05 |     1      |
| 2022-04-10 |     1      |
| 2022-04-11 |     2      |
| 2022-04-12 |     3      |
| 2022-04-13 |     2      |
| 2022-11-10 |     1      |
| 2022-11-11 |     1      |
| 2022-11-12 |     1      |
| 2022-11-13 |     1      |
| 2022-11-14 |     1      |
| 2022-11-15 |     1      |

CodePudding user response：

Use:

#convert column to datetimes
df['entry_date'] = pd.to_datetime(df['entry_date'])

#repeat rows by days column
df = df.loc[df.index.repeat(df['days'])]

#create days timedeltas
td = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')

#add timedeltas by datetiems and count to 2 columns DataFrame
df1 = (df['entry_date'].add(td)
                       .value_counts()
                       .sort_index()
                       .rename_axis('date')
                       .reset_index(name='ocupation'))
print (df1)

         date  ocupation
0  2022-02-01         1
1  2022-02-02         2
2  2022-02-03         2
3  2022-02-04         1
4  2022-02-05         1
5  2022-04-10         1
6  2022-04-11         2
7  2022-04-12         3
8  2022-04-13         2
9  2022-11-10         1
10 2022-11-11         1
11 2022-11-12         1
12 2022-11-13         1
13 2022-11-14         1
14 2022-11-15         1

Performance: Sample data repeated 1000 times:

df = pd.concat([df] * 1000, ignore_index=True)

def jez(df):
    #convert column to datetimes
    df['entry_date'] = pd.to_datetime(df['entry_date'], dayfirst=True)
    
    #repeat rows by days column
    df = df.loc[df.index.repeat(df['days'])]
    
    #create days timedeltas
    td = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
    
    #add timedeltas by datetiems and count to 2 columns DataFrame
    return (df['entry_date'].add(td)
                           .value_counts()
                           .sort_index()
                           .rename_axis('date')
                           .reset_index(name='ocupation'))
    


def moz(df):
    return (pd.Series([d for start, end in zip(df['entry_date'], df['exit_date'])
            for d in pd.date_range(start, end, freq='D')], name='date')
   .value_counts(sort=False)
   .reset_index(name='ocupation')
 )

In [122]: %timeit jez(df)
15.3 ms ± 470 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [123]: %timeit moz(df)
2.31 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response：

You can use date_range and value_counts:

# ensure datetime
# for year-day-month
df[['entry_date', 'exit_date']] = df[['entry_date', 'exit_date']].apply(pd.to_datetime, dayfirst=True)
# for year-month-day
df[['entry_date', 'exit_date']] = df[['entry_date', 'exit_date']].apply(pd.to_datetime, dayfirst=False)


(pd.Series([d for start, end in zip(df['entry_date'], df['exit_date'])
            for d in pd.date_range(start, end, freq='D')], name='date')
   .value_counts(sort=False)
   .reset_index(name='ocupation')
)

Output:

        index  ocupation
0  2022-02-01          1
1  2022-02-02          2
2  2022-02-03          2
3  2022-02-04          1
4  2022-02-05          1
5  2022-04-10          1
6  2022-04-11          2
7  2022-04-12          3
8  2022-04-13          2
9  2022-11-10          1
10 2022-11-11          1
11 2022-11-12          1
12 2022-11-13          1
13 2022-11-14          1
14 2022-11-15          1