Create a time series that sums data on each day D, if D is between the start date and the end date-CodePudding

My raw data is a dataframe with three columns that describe journeys: quantity, start date, end date. My goal is to create a new dataframe with a daily index and one single column that shows the sum of the quantities of the journeys that were "on the way" each day i.e. sum quantity if day > start date and day < end date. I think I can achieve this by creating a daily index and then using a for loop that on each day uses a mask to filter the data, then sums. I haven't managed to make it work but I think that there might actually be a better approach? Below is my attempt with some dummy data...

data = [[10, '2020-03-02', '2020-03-27'],
        [18, '2020-03-06', '2020-03-10'],
        [21, '2020-03-20', '2020-05-02'],
        [33, '2020-01-02', '2020-03-01']]
columns = ['quantity', 'startdate', 'enddate']
index = [1,2,3,4]
df = pd.DataFrame(data,index,columns)

index2 = pd.date_range(start='2020-01-01', end='2020-06-01', freq='D')
df2 = pd.DataFrame(0,index2,'quantities')
for t in index2:
    mask = (df['start']<t) & (df['end']>t)
    df2['quantities'] = df[mask]['quantity'].sum()

CodePudding user response：

Maybe you could create date range for each record, then explode and groupby:

data = [[10, '2020-03-02', '2020-03-27'],
        [18, '2020-03-06', '2020-03-10'],
        [21, '2020-03-20', '2020-05-02'],
        [33, '2020-01-02', '2020-03-01']]
columns = ['quantity', 'startdate', 'enddate']
index = [1,2,3,4]
df = pd.DataFrame(data,index,columns)

df['range'] = df.apply(lambda x: pd.date_range(x['startdate'],x['enddate'],freq='D'), axis=1)
df = df.explode('range')
df.groupby('range')['quantity'].sum()

CodePudding user response：

Your data describes a step function, ie on the 2nd of March (midnight) it increases by a value of 10, and on the 27th of March (midnight) it decreases by 10.

This solution uses a package called

If you want to just get the value at the start of each day then you can sample the step function like this

sf(dates, include_index=True)

The result will be a pandas.Series indexed by your date range

2020-01-01     0
2020-01-02    33
2020-01-03    33
2020-01-04    33
2020-01-05    33
            ..
2020-05-28     0
2020-05-29     0
2020-05-30     0
2020-05-31     0
2020-06-01     0
Freq: D, Length: 153, dtype: int64

A more general solution to your problem which includes start and end times at any datetime (not just midnight) and arbitrary bins can be achieved with slicing and integrating.