My raw data is a dataframe with three columns that describe journeys: quantity, start date, end date. My goal is to create a new dataframe with a daily index and one single column that shows the sum of the quantities of the journeys that were "on the way" each day i.e. sum quantity if day > start date and day < end date. I think I can achieve this by creating a daily index and then using a for loop that on each day uses a mask to filter the data, then sums. I haven't managed to make it work but I think that there might actually be a better approach? Below is my attempt with some dummy data...
data = [[10, '2020-03-02', '2020-03-27'],
[18, '2020-03-06', '2020-03-10'],
[21, '2020-03-20', '2020-05-02'],
[33, '2020-01-02', '2020-03-01']]
columns = ['quantity', 'startdate', 'enddate']
index = [1,2,3,4]
df = pd.DataFrame(data,index,columns)
index2 = pd.date_range(start='2020-01-01', end='2020-06-01', freq='D')
df2 = pd.DataFrame(0,index2,'quantities')
for t in index2:
mask = (df['start']<t) & (df['end']>t)
df2['quantities'] = df[mask]['quantity'].sum()
CodePudding user response:
Maybe you could create date range for each record, then explode and groupby:
data = [[10, '2020-03-02', '2020-03-27'],
[18, '2020-03-06', '2020-03-10'],
[21, '2020-03-20', '2020-05-02'],
[33, '2020-01-02', '2020-03-01']]
columns = ['quantity', 'startdate', 'enddate']
index = [1,2,3,4]
df = pd.DataFrame(data,index,columns)
df['range'] = df.apply(lambda x: pd.date_range(x['startdate'],x['enddate'],freq='D'), axis=1)
df = df.explode('range')
df.groupby('range')['quantity'].sum()
CodePudding user response:
Your data describes a step function, ie on the 2nd of March (midnight) it increases by a value of 10, and on the 27th of March (midnight) it decreases by 10.
This solution uses a package called
If you want to just get the value at the start of each day then you can sample the step function like this
sf(dates, include_index=True)
The result will be a pandas.Series
indexed by your date range
2020-01-01 0
2020-01-02 33
2020-01-03 33
2020-01-04 33
2020-01-05 33
..
2020-05-28 0
2020-05-29 0
2020-05-30 0
2020-05-31 0
2020-06-01 0
Freq: D, Length: 153, dtype: int64
A more general solution to your problem which includes start and end times at any datetime (not just midnight) and arbitrary bins can be achieved with slicing and integrating.