I have a DataFrame containing records on the start time, end time, user, product and usage rate. A user is capable of using the same product multiple times at different rates. I want to transform this DataFrame into a timeseries dataset where for each user, bins of 4 hours are created in which the usage rate is added up. As start and end times can overlap with the bin boundaries, fractional usage over that bin would need to be calculated and each product receives their own column.
Original DataFrame:
start | end | user | product | usage_rate |
---|---|---|---|---|
1 | 12 | A | X_1 | 10 |
8 | 15 | A | X_1 | 20 |
3 | 7 | B | X_1 | 3 |
3 | 8 | B | X_2 | 70 |
Desired output:
user | bin | start | end | X_1 | X_2 |
---|---|---|---|---|---|
A | 0-4 | 0 | 4 | 30 | 0 |
A | 4-8 | 4 | 8 | 40 | 0 |
A | 8-12 | 8 | 12 | 120 | 0 |
A | 12-16 | 12 | 16 | 60 | 0 |
B | 0-4 | 0 | 4 | 3 | 70 |
B | 4-8 | 4 | 8 | 9 | 280 |
I've tried to use pd.cut
to create bins, but only manage to create only DataFrames where the start time matches and/or the end time matches, but not the bins in between. I want to prevent using for loops as these can become quite slow when expanding this to 10k users and 20 products. The process I thought of first duplicates each row for the number of time bins required (0-4, 4-8, 8-12 for the first record), calculates the duration within that bin, multiplies duration by usage_rate
to get used_in_bin
value, then groupby(['user', 'bin'])
and retrieves the sum of those used_in_bin
values. Then the last step would be to pivot the product name and values to get the table.
The hard part is that I'm stuck creating the time bins for time ranges (either datetime or numerical). In what way can I create duplicate records for the missing bins between the start and end time of my range?
CodePudding user response:
The data you have is described by mathematical step functions, and
What you are wanting to do is create bins and integrate (find the area under) these step functions. For a step function sf
this means the following calculation
sf.slice([0,4,8,12,16]).integral()
We can use pandas.Series.apply
to do this for all our step functions.
binned = stepfunctions.apply(lambda sf: sf.slice([0,4,8,12,16]).integral())
The binned
variable will we a dataframe, with the same index as stepfunctions
, with a column for each bin interval
[0, 4) [4, 8) [8, 12) [12, 16)
user product
A X_1 30.0 40.0 120.0 60.0
B X_1 3.0 9.0 0.0 0.0
X_2 70.0 280.0 0.0 0.0
To get this data in a tidy format the following can be used
tidy_result = binned.melt(ignore_index=False).rename({"variable":"bin"}).reset_index()
The tidy_result
dataframe will look like this:
user product bin value
0 A X_1 [0, 4) 30.0
1 B X_1 [0, 4) 3.0
2 B X_2 [0, 4) 70.0
3 A X_1 [4, 8) 40.0
4 B X_1 [4, 8) 9.0
5 B X_2 [4, 8) 280.0
6 A X_1 [8, 12) 120.0
7 B X_1 [8, 12) 0.0
8 B X_2 [8, 12) 0.0
9 A X_1 [12, 16) 60.0
10 B X_1 [12, 16) 0.0
11 B X_2 [12, 16) 0.0
This may be enough for your purposes. If you want to get it in the exact format you have submitted, then this should be easy enough:
- use
tidy_result.pivot(index=["user", "bin"], columns="product")
to turn the product column into columns for each product. FillNA
values forA
andX_2
with 0. - use
tidy_result["start"] = pd.IntervalIndex(tidy_result["bin"]).left
etc
To recap, the solution (after imports) boils down to the following three lines
stepfunctions = df.groupby(["user", "product"]).apply(sc.Stairs, "start", "end", "usage_rate")
binned = stepfunctions.apply(lambda sf: sf.slice([0,4,8,12,16]).integral())
tidy_result = binned.melt(ignore_index=False).rename({"variable":"bin"}).reset_index()