Home > OS >  Count of active items on day given start and stop date
Count of active items on day given start and stop date

Time:03-05

I have a dataframe with 2 columns similar to below.

 ------ ------------- ------------ 
| id    | start_date |  stop_date |
 ------ ------------- ------------ 
| Foo   | 2019-06-01 | 2019-06-03 | 
| Bar   | 2019-06-07 | 2019-06-10 | 
| Pop   | 2019-06-09 | 2019-06-11 |          
| Bob   | 2019-06-13 |            | 
| Tom   | 2019-06-01 | 2019-06-05 |            
| Tim   | 2019-06-04 | 2019-06-05 |            
| Ben   | 2019-06-07 | 2019-06-09 |            
| Ted   | 2019-06-08 | 2019-06-09 |            
 ------ ------------ ------------- 

I need to return 2 df's, one with the count of active items within the date range (example below)

 ------------ ------- 
|    Day     |Active |
 ------------ ------- 
| 2019-06-01 |    2  |
| 2019-06-02 |    2  |
| 2019-06-03 |    2  |   
| 2019-06-04 |    2  |
| 2019-06-05 |    2  |
| 2019-06-06 |    0  |
| 2019-06-07 |    2  |
| 2019-06-08 |    3  |
| 2019-06-09 |    4  |     
| 2019-06-10 |    2  |         
| 2019-06-11 |    1  |          
| 2019-06-12 |    0  |         
| 2019-06-13 |    1  |         
| 2019-06-14 |    1  |
| 2019-06-15 |    1  |        
 ------------ ------- 

and another that returns a df with that contain active items for a given date ie 2019-06-10 returns df:

 | Bar   | 2019-06-07 | 2019-06-10 | 
 | Pop   | 2019-06-09 | 2019-06-11 |

So far I have tried to return the the second example:

active_date = pd.Timestamp('2019-06-10')

df_active = df[(df['start_date'] <= active_date) & ((df["stop_date"].isnull()) | (df["stop_date"] > active_date))]`

Any help is appreciated!

CodePudding user response:

You can do this:

df[["start_date", "stop_date"]] = df[["start_date", "stop_date"]].apply(pd.to_datetime)

df = df.ffill(axis=1)
df["days"] = [
    pd.date_range(s, e, freq="D") for s, e in zip(df["start_date"], df["stop_date"])
]

df2 = (
    df.explode("days")
    .groupby("days")["id"]
    .nunique()
    .reindex(pd.date_range(df["start_date"].min(), df["stop_date"].max()), fill_value=0)
)

Output:

2019-06-01    2
2019-06-02    2
2019-06-03    2
2019-06-04    2
2019-06-05    2
2019-06-06    0
2019-06-07    2
2019-06-08    3
2019-06-09    4
2019-06-10    2
2019-06-11    1
2019-06-12    0
2019-06-13    1
Freq: D, Name: id, dtype: int64

And, use pd.IntervalIndex:

active_date = pd.Timestamp('2019-06-10')

df[
    pd.IntervalIndex.from_arrays(df["start_date"], df["stop_date"]).contains(
        active_date
    )
].drop("days", axis=1)

Output:

    id start_date  stop_date
1  Bar 2019-06-07 2019-06-10
2  Pop 2019-06-09 2019-06-11
  • Related