Finding total "wait" time for concurrently running transactions-CodePudding

I need to evaluate several million lines of performance logging for a manufacturing execution system. I need to group the data by date, class and name with finding the total "wait time" of numerous concurrently running transactions. The data comes in looking similar to what is in this dataframe:

    import pandas as pd

    d = {'START_DATE': ['2021-08-07 19:11:40', '2021-08-07 19:11:40', '2021-08-07 19:11:40',
                   '2021-08-07 19:20:40', '2021-08-07 19:20:40', '2021-08-07 19:20:40',
                   '2021-08-07 19:21:40', '2021-08-07 19:21:40', '2021-08-07 19:21:40',
                   '2021-08-10 19:20:40', '2021-08-10 19:20:40', '2021-08-10 19:20:40',
                   '2021-08-10 19:21:40', '2021-08-10 19:21:40', '2021-08-10 19:21:40'
                   ],
    
        'ELAPSED_TIME': ['00:00:00.465', '00:00:01.000', '00:00:00.165',
                         '00:00:00.100', '00:00:00.200', '00:03:00.000',
                         '00:05:00.000', '00:00:00.200', '00:00:03.000',
                         '00:00:00.100', '00:00:00.200', '00:03:00.000',
                         '00:05:00.000', '00:00:00.200', '00:00:03.000'
                         ],
    
        'TRANSACTION': ['a', 'b', 'c',
                        'a', 'd', 'c',
                        'e', 'a', 'b',
                        'a', 'd', 'c',
                        'e', 'a', 'b'
                        ],
    
        'USER': ['Bob', 'Bob', 'Bob',
                 'Biff', 'Biff', 'Biff',
                 'Biff', 'Biff', 'Biff',
                 'Bob', 'Bob', 'Bob',
                 'Bob', 'Bob', 'Bob'
                 ],
    
        'CLASS':  ['AA', 'AA', 'AA',
                   'BB', 'BB', 'BB',
                   'BB', 'BB', 'BB',
                   'AA', 'AA', 'AA',
                   'AA', 'AA', 'AA'
                   ]}

    df = pd.DataFrame(data=d)

See how the transaction times will start at the same time and run concurrent with each other, but will be "done" at different times. E.g. Bob's first set of transactions (rows 0-2) all take a different amount of time, but when I group by DATE, CLASS, and USER--I want to show the total wait time to be 1000ms (based on the second line's wait time).

On 08/07/2021, Biff has two sets of transactions starting at different times, but they will still overlap into one wait time--6000ms.

Expected output would look something like:

DATE           CLASS     USER     Wait
2021-08-07     AA        Bob      1000
2021-08-07     BB        Biff     360000
2021-08-10     AA        Bob      360000

Like I mentioned the actual data has several millions lines of transactions--I am looking for help in finding something better (and hopefully faster than what I have/found):

def getSecs1(grp):
    return pd.DatetimeIndex([]).union_many([ pd.date_range(
        row.START_DATE, row.END_DATE, freq='25ms', closed='left')
            for _, row in grp.iterrows() ]).size

I add an END_DATE column by adding the milliseconds to the START_DATE. I have to do it with chunks of 25ms otherwise it would take wwaayy too long to do.

Any help/advice would be greatly appreciated.

###Edit Change the overlap to minutes

CodePudding user response：

This solution uses a package called staircase which is built on pandas and numpy for working with (mathematical) step functions. You can think of an interval as being a step function which goes from value 0 to 1 at the start of an interval and 1 to 0 at the end of an interval.

additional setup

convert START_DATE and ELAPSED_TIME to appropriate pandas time objects

df["START_DATE"] = pd.to_datetime(df["START_DATE"])
df["ELAPSED_TIME"] = pd.to_timedelta(df["ELAPSED_TIME"])

define daily bins

dates = pd.period_range("2021-08-07", "2021-08-10")

solution

Define a function which takes a dataframe, makes a step function from start and end times (calculated as start duration), sets non-zero values to 1, slices the step function with the bins, and integrates.

import staircase as sc

def calc_dates_for_user(df_):
    return (
        sc.Stairs(  # creating step function
             start=df_["START_DATE"],
             end=df_["START_DATE"]   df_["ELAPSED_TIME"],
        )
        .make_boolean()  # where two intervals overlap the value of the step function will be 2.  This sets all non-zero values to 1 (effectively creating a union of intervals).
        .slice(dates)  # analogous to groupby
        .integral()/pd.Timedelta("1s")  # for each slice integrate (which will equal the length of the interval) and divide by seconds
    )

When we groupby USER and CLASS and apply this function we get a dataframe, indexed by these variables, with a column index corresponding to intervals in the period range

USER CLASS   [2021-08-07, 2021-08-08)  [2021-08-08, 2021-08-09)  [2021-08-09, 2021-08-10)    [2021-08-10, 2021-08-11)                                              
Biff BB                      360000.0                        0.0                        0.0                       0.0 
Bob  AA                        1000.0                        0.0                        0.0                  360000.0

We'll clean it up like so

result = (
    df.groupby(["USER", "CLASS"])  
    .apply(calc_dates_for_user)
    .melt(ignore_index=False, var_name="DATE", value_name="WAIT")  # melt column index into a single column of daily intervals
    .query("WAIT != 0")  # filter out days where no time recorded
    .reset_index() # move USER and CLASS from index to columns
)

result then looks like this

   USER CLASS                      DATE      WAIT
0  Biff    BB  [2021-08-07, 2021-08-08)  360000.0
1   Bob    AA  [2021-08-07, 2021-08-08)    1000.0
2   Bob    AA  [2021-08-10, 2021-08-11)  360000.0

To get your expected result you can replace the DATE column with the timestamps relating to day-start with

result["DATE"] = pd.IntervalIndex(result["DATE"]).left