I have a dataframe with 2 columns similar to below.
------ ------------- ------------
| id | start_date | stop_date |
------ ------------- ------------
| Foo | 2019-06-01 | 2019-06-03 |
| Bar | 2019-06-07 | 2019-06-10 |
| Pop | 2019-06-09 | 2019-06-11 |
| Bob | 2019-06-13 | |
| Tom | 2019-06-01 | 2019-06-05 |
| Tim | 2019-06-04 | 2019-06-05 |
| Ben | 2019-06-07 | 2019-06-09 |
| Ted | 2019-06-08 | 2019-06-09 |
------ ------------ -------------
I need to return 2 df's, one with the count of active items within the date range (example below)
------------ -------
| Day |Active |
------------ -------
| 2019-06-01 | 2 |
| 2019-06-02 | 2 |
| 2019-06-03 | 2 |
| 2019-06-04 | 2 |
| 2019-06-05 | 2 |
| 2019-06-06 | 0 |
| 2019-06-07 | 2 |
| 2019-06-08 | 3 |
| 2019-06-09 | 4 |
| 2019-06-10 | 2 |
| 2019-06-11 | 1 |
| 2019-06-12 | 0 |
| 2019-06-13 | 1 |
| 2019-06-14 | 1 |
| 2019-06-15 | 1 |
------------ -------
and another that returns a df with that contain active items for a given date ie 2019-06-10 returns df:
| Bar | 2019-06-07 | 2019-06-10 |
| Pop | 2019-06-09 | 2019-06-11 |
So far I have tried to return the the second example:
active_date = pd.Timestamp('2019-06-10')
df_active = df[(df['start_date'] <= active_date) & ((df["stop_date"].isnull()) | (df["stop_date"] > active_date))]`
Any help is appreciated!
CodePudding user response:
You can do this:
df[["start_date", "stop_date"]] = df[["start_date", "stop_date"]].apply(pd.to_datetime)
df = df.ffill(axis=1)
df["days"] = [
pd.date_range(s, e, freq="D") for s, e in zip(df["start_date"], df["stop_date"])
]
df2 = (
df.explode("days")
.groupby("days")["id"]
.nunique()
.reindex(pd.date_range(df["start_date"].min(), df["stop_date"].max()), fill_value=0)
)
Output:
2019-06-01 2
2019-06-02 2
2019-06-03 2
2019-06-04 2
2019-06-05 2
2019-06-06 0
2019-06-07 2
2019-06-08 3
2019-06-09 4
2019-06-10 2
2019-06-11 1
2019-06-12 0
2019-06-13 1
Freq: D, Name: id, dtype: int64
And, use pd.IntervalIndex:
active_date = pd.Timestamp('2019-06-10')
df[
pd.IntervalIndex.from_arrays(df["start_date"], df["stop_date"]).contains(
active_date
)
].drop("days", axis=1)
Output:
id start_date stop_date
1 Bar 2019-06-07 2019-06-10
2 Pop 2019-06-09 2019-06-11