Cumulative sum that updates between two date ranges-CodePudding

I have data that looks like this: (assume start and end are date times)

id	start	end
1	01-01	01-02
1	01-03	01-05
1	01-04	01-07
1	01-06	NaT
1	01-07	NaT

I want to get a data frame that would include all dates, that has a 'cumulative sum' that only counts for the range they are in.

dates	count
01-01	1
01-02	0
01-03	1
01-04	2
01-05	1
01-06	2
01-07	3

One idea I thought of was simply using cumcount on the start dates, and doing a 'reverse cumcount' decreasing the counts using the end dates, but I am having trouble wrapping my head around doing this in pandas and I'm wondering whether there's a more elegant solution.

CodePudding user response：

Here is two options. first consider this data with only one id, note that your columns start and end must be datetime.

d = {'id': [1, 1, 1, 1, 1],
     'start': [pd.Timestamp('2021-01-01'), pd.Timestamp('2021-01-03'),
               pd.Timestamp('2021-01-04'), pd.Timestamp('2021-01-06'),
               pd.Timestamp('2021-01-07')],
     'end': [pd.Timestamp('2021-01-02'), pd.Timestamp('2021-01-05'),
             pd.Timestamp('2021-01-07'), pd.NaT, pd.NaT]}
df = pd.DataFrame(d)

so to get your result, you can do a sub between the get_dummies of start and end. then sum if several start and or end at the same dates, cumsum along the dates, reindex to get all the dates between the min and max dates available. create a function.

def dates_cc(df_):
    return (
        pd.get_dummies(df_['start'])
          .sub(pd.get_dummies(df_['end'], dtype=int), fill_value=0)
          .sum()
          .cumsum()
          .to_frame(name='count')
          .reindex(pd.date_range(df_['start'].min(), df_['end'].max()), method='ffill')
          .rename_axis('dates')
    )

Now you can apply this function to your dataframe

res = dates_cc(df).reset_index()
print(res)
#        dates  count
# 0 2021-01-01    1.0
# 1 2021-01-02    0.0
# 2 2021-01-03    1.0
# 3 2021-01-04    2.0
# 4 2021-01-05    1.0
# 5 2021-01-06    2.0
# 6 2021-01-07    2.0

Now if you have several id, like

df1 = df.assign(id=[1,1,2,2,2])
print(df1)
#    id      start        end
# 0   1 2021-01-01 2021-01-02
# 1   1 2021-01-03 2021-01-05
# 2   2 2021-01-04 2021-01-07
# 3   2 2021-01-06        NaT
# 4   2 2021-01-07        NaT

then you can use the above function like

res1 = df1.groupby('id').apply(dates_cc).reset_index()
print(res1)
#    id      dates  count
# 0   1 2021-01-01    1.0
# 1   1 2021-01-02    0.0
# 2   1 2021-01-03    1.0
# 3   1 2021-01-04    1.0
# 4   1 2021-01-05    0.0
# 5   2 2021-01-04    1.0
# 6   2 2021-01-05    1.0
# 7   2 2021-01-06    2.0
# 8   2 2021-01-07    2.0

that said, a more straightforward possibility is with crosstab that create a row per id, the rest is about the same manipulations.

res2 = (
    pd.crosstab(index=df1['id'], columns=df1['start'])
      .sub(pd.crosstab(index=df1['id'], columns=df1['end']), fill_value=0)
      .reindex(columns=pd.date_range(df1['start'].min(), df1['end'].max()), fill_value=0)
      .rename_axis(columns='dates')
      .cumsum(axis=1)
      .stack()
      .reset_index(name='count')
)
print(res2)
#     id      dates  count
# 0    1 2021-01-01    1.0
# 1    1 2021-01-02    0.0
# 2    1 2021-01-03    1.0
# 3    1 2021-01-04    1.0
# 4    1 2021-01-05    0.0
# 5    1 2021-01-06    0.0
# 6    1 2021-01-07    0.0
# 7    2 2021-01-01    0.0
# 8    2 2021-01-02    0.0
# 9    2 2021-01-03    0.0
# 10   2 2021-01-04    1.0
# 11   2 2021-01-05    1.0
# 12   2 2021-01-06    2.0
# 13   2 2021-01-07    2.0

the main difference between the two options is that this one create extra dates for each id, because for example 2021-01-01 is in id=1 but not id=2 and with this version, you get this date also for id=2 while in groupby it is not taken into account.