Home > Enterprise >  Create a time series that sums data on each day D, if D is between the start date and the end date
Create a time series that sums data on each day D, if D is between the start date and the end date

Time:11-27

My raw data is a dataframe with three columns that describe journeys: quantity, start date, end date. My goal is to create a new dataframe with a daily index and one single column that shows the sum of the quantities of the journeys that were "on the way" each day i.e. sum quantity if day > start date and day < end date. I think I can achieve this by creating a daily index and then using a for loop that on each day uses a mask to filter the data, then sums. I haven't managed to make it work but I think that there might actually be a better approach? Below is my attempt with some dummy data...

data = [[10, '2020-03-02', '2020-03-27'],
        [18, '2020-03-06', '2020-03-10'],
        [21, '2020-03-20', '2020-05-02'],
        [33, '2020-01-02', '2020-03-01']]
columns = ['quantity', 'startdate', 'enddate']
index = [1,2,3,4]
df = pd.DataFrame(data,index,columns)

index2 = pd.date_range(start='2020-01-01', end='2020-06-01', freq='D')
df2 = pd.DataFrame(0,index2,'quantities')
for t in index2:
    mask = (df['start']<t) & (df['end']>t)
    df2['quantities'] = df[mask]['quantity'].sum()

CodePudding user response:

Maybe you could create date range for each record, then explode and groupby:

data = [[10, '2020-03-02', '2020-03-27'],
        [18, '2020-03-06', '2020-03-10'],
        [21, '2020-03-20', '2020-05-02'],
        [33, '2020-01-02', '2020-03-01']]
columns = ['quantity', 'startdate', 'enddate']
index = [1,2,3,4]
df = pd.DataFrame(data,index,columns)

df['range'] = df.apply(lambda x: pd.date_range(x['startdate'],x['enddate'],freq='D'), axis=1)
df = df.explode('range')
df.groupby('range')['quantity'].sum()

CodePudding user response:

Your data describes a step function, ie on the 2nd of March (midnight) it increases by a value of 10, and on the 27th of March (midnight) it decreases by 10.

This solution uses a package called enter image description here

If you want to just get the value at the start of each day then you can sample the step function like this

sf(dates, include_index=True)

The result will be a pandas.Series indexed by your date range

2020-01-01     0
2020-01-02    33
2020-01-03    33
2020-01-04    33
2020-01-05    33
            ..
2020-05-28     0
2020-05-29     0
2020-05-30     0
2020-05-31     0
2020-06-01     0
Freq: D, Length: 153, dtype: int64

A more general solution to your problem which includes start and end times at any datetime (not just midnight) and arbitrary bins can be achieved with slicing and integrating.

  • Related