I need to evaluate several million lines of performance logging for a manufacturing execution system. I need to group the data by date, class and name with finding the total "wait time" of numerous concurrently running transactions. The data comes in looking similar to what is in this dataframe:
import pandas as pd
d = {'START_DATE': ['2021-08-07 19:11:40', '2021-08-07 19:11:40', '2021-08-07 19:11:40',
'2021-08-07 19:20:40', '2021-08-07 19:20:40', '2021-08-07 19:20:40',
'2021-08-07 19:21:40', '2021-08-07 19:21:40', '2021-08-07 19:21:40',
'2021-08-10 19:20:40', '2021-08-10 19:20:40', '2021-08-10 19:20:40',
'2021-08-10 19:21:40', '2021-08-10 19:21:40', '2021-08-10 19:21:40'
],
'ELAPSED_TIME': ['00:00:00.465', '00:00:01.000', '00:00:00.165',
'00:00:00.100', '00:00:00.200', '00:03:00.000',
'00:05:00.000', '00:00:00.200', '00:00:03.000',
'00:00:00.100', '00:00:00.200', '00:03:00.000',
'00:05:00.000', '00:00:00.200', '00:00:03.000'
],
'TRANSACTION': ['a', 'b', 'c',
'a', 'd', 'c',
'e', 'a', 'b',
'a', 'd', 'c',
'e', 'a', 'b'
],
'USER': ['Bob', 'Bob', 'Bob',
'Biff', 'Biff', 'Biff',
'Biff', 'Biff', 'Biff',
'Bob', 'Bob', 'Bob',
'Bob', 'Bob', 'Bob'
],
'CLASS': ['AA', 'AA', 'AA',
'BB', 'BB', 'BB',
'BB', 'BB', 'BB',
'AA', 'AA', 'AA',
'AA', 'AA', 'AA'
]}
df = pd.DataFrame(data=d)
See how the transaction times will start at the same time and run concurrent with each other, but will be "done" at different times. E.g. Bob's first set of transactions (rows 0-2) all take a different amount of time, but when I group by DATE, CLASS, and USER--I want to show the total wait time to be 1000ms (based on the second line's wait time).
On 08/07/2021, Biff has two sets of transactions starting at different times, but they will still overlap into one wait time--6000ms.
Expected output would look something like:
DATE CLASS USER Wait
2021-08-07 AA Bob 1000
2021-08-07 BB Biff 360000
2021-08-10 AA Bob 360000
Like I mentioned the actual data has several millions lines of transactions--I am looking for help in finding something better (and hopefully faster than what I have/found):
def getSecs1(grp):
return pd.DatetimeIndex([]).union_many([ pd.date_range(
row.START_DATE, row.END_DATE, freq='25ms', closed='left')
for _, row in grp.iterrows() ]).size
I add an END_DATE column by adding the milliseconds to the START_DATE. I have to do it with chunks of 25ms otherwise it would take wwaayy too long to do.
Any help/advice would be greatly appreciated.
###Edit Change the overlap to minutes
CodePudding user response:
This solution uses a package called staircase
which is built on pandas and numpy for working with (mathematical) step functions. You can think of an interval as being a step function which goes from value 0 to 1 at the start of an interval and 1 to 0 at the end of an interval.
additional setup
convert START_DATE
and ELAPSED_TIME
to appropriate pandas
time objects
df["START_DATE"] = pd.to_datetime(df["START_DATE"])
df["ELAPSED_TIME"] = pd.to_timedelta(df["ELAPSED_TIME"])
define daily bins
dates = pd.period_range("2021-08-07", "2021-08-10")
solution
Define a function which takes a dataframe, makes a step function from start and end times (calculated as start duration), sets non-zero values to 1, slices the step function with the bins, and integrates.
import staircase as sc
def calc_dates_for_user(df_):
return (
sc.Stairs( # creating step function
start=df_["START_DATE"],
end=df_["START_DATE"] df_["ELAPSED_TIME"],
)
.make_boolean() # where two intervals overlap the value of the step function will be 2. This sets all non-zero values to 1 (effectively creating a union of intervals).
.slice(dates) # analogous to groupby
.integral()/pd.Timedelta("1s") # for each slice integrate (which will equal the length of the interval) and divide by seconds
)
When we groupby USER
and CLASS
and apply this function we get a dataframe, indexed by these variables, with a column index corresponding to intervals in the period range
USER CLASS [2021-08-07, 2021-08-08) [2021-08-08, 2021-08-09) [2021-08-09, 2021-08-10) [2021-08-10, 2021-08-11)
Biff BB 360000.0 0.0 0.0 0.0
Bob AA 1000.0 0.0 0.0 360000.0
We'll clean it up like so
result = (
df.groupby(["USER", "CLASS"])
.apply(calc_dates_for_user)
.melt(ignore_index=False, var_name="DATE", value_name="WAIT") # melt column index into a single column of daily intervals
.query("WAIT != 0") # filter out days where no time recorded
.reset_index() # move USER and CLASS from index to columns
)
result
then looks like this
USER CLASS DATE WAIT
0 Biff BB [2021-08-07, 2021-08-08) 360000.0
1 Bob AA [2021-08-07, 2021-08-08) 1000.0
2 Bob AA [2021-08-10, 2021-08-11) 360000.0
To get your expected result you can replace the DATE column with the timestamps relating to day-start with
result["DATE"] = pd.IntervalIndex(result["DATE"]).left