Aggregate interval data by binning to create time series-CodePudding

I have a DataFrame containing records on the start time, end time, user, product and usage rate. A user is capable of using the same product multiple times at different rates. I want to transform this DataFrame into a timeseries dataset where for each user, bins of 4 hours are created in which the usage rate is added up. As start and end times can overlap with the bin boundaries, fractional usage over that bin would need to be calculated and each product receives their own column.

Original DataFrame:

start	end	user	product	usage_rate
1	12	A	X_1	10
8	15	A	X_1	20
3	7	B	X_1	3
3	8	B	X_2	70

Desired output:

user	bin	start	end	X_1	X_2
A	0-4	0	4	30	0
A	4-8	4	8	40	0
A	8-12	8	12	120	0
A	12-16	12	16	60	0
B	0-4	0	4	3	70
B	4-8	4	8	9	280

I've tried to use pd.cut to create bins, but only manage to create only DataFrames where the start time matches and/or the end time matches, but not the bins in between. I want to prevent using for loops as these can become quite slow when expanding this to 10k users and 20 products. The process I thought of first duplicates each row for the number of time bins required (0-4, 4-8, 8-12 for the first record), calculates the duration within that bin, multiplies duration by usage_rate to get used_in_bin value, then groupby(['user', 'bin']) and retrieves the sum of those used_in_bin values. Then the last step would be to pivot the product name and values to get the table.

The hard part is that I'm stuck creating the time bins for time ranges (either datetime or numerical). In what way can I create duplicate records for the missing bins between the start and end time of my range?

CodePudding user response：

The data you have is described by mathematical step functions, and

What you are wanting to do is create bins and integrate (find the area under) these step functions. For a step function sf this means the following calculation

sf.slice([0,4,8,12,16]).integral()

We can use pandas.Series.apply to do this for all our step functions.

binned = stepfunctions.apply(lambda sf: sf.slice([0,4,8,12,16]).integral())

The binned variable will we a dataframe, with the same index as stepfunctions, with a column for each bin interval

               [0, 4)  [4, 8)  [8, 12)  [12, 16)
user product                                   
A    X_1        30.0    40.0    120.0      60.0
B    X_1         3.0     9.0      0.0       0.0
     X_2        70.0   280.0      0.0       0.0

To get this data in a tidy format the following can be used

tidy_result = binned.melt(ignore_index=False).rename({"variable":"bin"}).reset_index()

The tidy_result dataframe will look like this:

   user product       bin  value
0     A     X_1    [0, 4)   30.0
1     B     X_1    [0, 4)    3.0
2     B     X_2    [0, 4)   70.0
3     A     X_1    [4, 8)   40.0
4     B     X_1    [4, 8)    9.0
5     B     X_2    [4, 8)  280.0
6     A     X_1   [8, 12)  120.0
7     B     X_1   [8, 12)    0.0
8     B     X_2   [8, 12)    0.0
9     A     X_1  [12, 16)   60.0
10    B     X_1  [12, 16)    0.0
11    B     X_2  [12, 16)    0.0

This may be enough for your purposes. If you want to get it in the exact format you have submitted, then this should be easy enough:

use tidy_result.pivot(index=["user", "bin"], columns="product") to turn the product column into columns for each product. Fill NA values for A and X_2 with 0.
use tidy_result["start"] = pd.IntervalIndex(tidy_result["bin"]).left etc

To recap, the solution (after imports) boils down to the following three lines

stepfunctions = df.groupby(["user", "product"]).apply(sc.Stairs, "start", "end", "usage_rate")
binned = stepfunctions.apply(lambda sf: sf.slice([0,4,8,12,16]).integral())
tidy_result = binned.melt(ignore_index=False).rename({"variable":"bin"}).reset_index()