Let it be the following python pandas DataFrame where each row represents a person's stay in a hotel.
| entry_date | exit_date | days | other_columns |
| ---------- | ---------- | ------ | ------------- |
| 2022-02-01 | 2022-02-05 | 5 | ... |
| 2022-02-02 | 2022-02-03 | 2 | ... |
| 2022-04-10 | 2022-04-13 | 4 | ... |
| 2022-04-11 | 2022-04-12 | 2 | ... |
| 2022-04-12 | 2022-04-13 | 2 | ... |
| 2022-11-10 | 2022-11-15 | 6 | ... |
I want to make a DataFrame from the previous one, where it represents for each day, the occupancy of the hotel at that moment. I am not taking into account the nights, just the days variable.
| date | ocupation |
| ---------- | ---------- |
| 2022-02-01 | 1 |
| 2022-02-02 | 2 |
| 2022-02-03 | 2 |
| 2022-02-04 | 1 |
| 2022-02-05 | 1 |
| 2022-04-10 | 1 |
| 2022-04-11 | 2 |
| 2022-04-12 | 3 |
| 2022-04-13 | 2 |
| 2022-11-10 | 1 |
| 2022-11-11 | 1 |
| 2022-11-12 | 1 |
| 2022-11-13 | 1 |
| 2022-11-14 | 1 |
| 2022-11-15 | 1 |
CodePudding user response:
Use:
#convert column to datetimes
df['entry_date'] = pd.to_datetime(df['entry_date'])
#repeat rows by days column
df = df.loc[df.index.repeat(df['days'])]
#create days timedeltas
td = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
#add timedeltas by datetiems and count to 2 columns DataFrame
df1 = (df['entry_date'].add(td)
.value_counts()
.sort_index()
.rename_axis('date')
.reset_index(name='ocupation'))
print (df1)
date ocupation
0 2022-02-01 1
1 2022-02-02 2
2 2022-02-03 2
3 2022-02-04 1
4 2022-02-05 1
5 2022-04-10 1
6 2022-04-11 2
7 2022-04-12 3
8 2022-04-13 2
9 2022-11-10 1
10 2022-11-11 1
11 2022-11-12 1
12 2022-11-13 1
13 2022-11-14 1
14 2022-11-15 1
Performance: Sample data repeated 1000 times:
df = pd.concat([df] * 1000, ignore_index=True)
def jez(df):
#convert column to datetimes
df['entry_date'] = pd.to_datetime(df['entry_date'], dayfirst=True)
#repeat rows by days column
df = df.loc[df.index.repeat(df['days'])]
#create days timedeltas
td = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
#add timedeltas by datetiems and count to 2 columns DataFrame
return (df['entry_date'].add(td)
.value_counts()
.sort_index()
.rename_axis('date')
.reset_index(name='ocupation'))
def moz(df):
return (pd.Series([d for start, end in zip(df['entry_date'], df['exit_date'])
for d in pd.date_range(start, end, freq='D')], name='date')
.value_counts(sort=False)
.reset_index(name='ocupation')
)
In [122]: %timeit jez(df)
15.3 ms ± 470 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [123]: %timeit moz(df)
2.31 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CodePudding user response:
You can use date_range
and value_counts
:
# ensure datetime
# for year-day-month
df[['entry_date', 'exit_date']] = df[['entry_date', 'exit_date']].apply(pd.to_datetime, dayfirst=True)
# for year-month-day
df[['entry_date', 'exit_date']] = df[['entry_date', 'exit_date']].apply(pd.to_datetime, dayfirst=False)
(pd.Series([d for start, end in zip(df['entry_date'], df['exit_date'])
for d in pd.date_range(start, end, freq='D')], name='date')
.value_counts(sort=False)
.reset_index(name='ocupation')
)
Output:
index ocupation
0 2022-02-01 1
1 2022-02-02 2
2 2022-02-03 2
3 2022-02-04 1
4 2022-02-05 1
5 2022-04-10 1
6 2022-04-11 2
7 2022-04-12 3
8 2022-04-13 2
9 2022-11-10 1
10 2022-11-11 1
11 2022-11-12 1
12 2022-11-13 1
13 2022-11-14 1
14 2022-11-15 1