Performance tweak for expanding().mean() over groups-CodePudding

Having the following DF of user events:

    id                timestamp
0    1  2021-11-23 11:01:00.000
1    1  2021-11-23 11:02:00.000
2    1  2021-11-23 11:10:00.000
3    1  2021-11-23 11:11:00.000
4    1  2021-11-23 11:22:00.000
5    1  2021-11-23 11:40:00.000
6    1  2021-11-23 11:41:00.000
7    1  2021-11-23 11:42:00.000
8    1  2021-11-23 11:43:00.000
9    1  2021-11-23 11:44:00.000
10   2  2021-11-23 11:01:00.000
11   2  2021-11-23 11:02:00.000
12   2  2021-11-23 11:10:00.000
13   2  2021-11-23 11:11:00.000
14   2  2021-11-23 11:22:00.000
15   2  2021-11-23 11:40:00.000
16   2  2021-11-23 11:41:00.000
17   2  2021-11-23 11:42:00.000
18   2  2021-11-23 11:43:00.000
19   2  2021-11-23 11:44:00.000

I calculate the average session time per row as follows:

Each session is a sequence of events less than 5 minutes apart.
I calculate the number of seconds between the first event in the session and the current event.
I then calculate expanding().mean() for each user.

Here is my code:

def average_session_time(**kwargs):
    df = kwargs['df'].copy()
    df['timestamp'] = pd.to_datetime(df.timestamp)
    df['session_grp'] = df.groupby('id').apply(
        lambda x: (x.groupby([pd.Grouper(key="timestamp", freq='5min', origin='start')])).ngroup()).reset_index(
        drop=True).values.reshape(-1)

    # Getting relevant 5min groups
    ng = df.groupby(['id', 'session_grp'])
    df['fts'] = ng['timestamp'].transform('first')
    df['delta'] = df['timestamp'].sub(df['fts']).dt.total_seconds()

    return df.groupby('id')['delta'].expanding().mean().reset_index(drop=True)

And the output is:

0      0.000000
1     30.000000
2     20.000000
3     15.000000
4     12.000000
5     10.000000
6      8.571429
7     15.000000
8     26.666667
9     42.000000
10     0.000000
11    30.000000
12    20.000000
13    15.000000
14    12.000000
15    10.000000
16     8.571429
17    15.000000
18    26.666667
19    42.000000
Name: delta, dtype: float64

The code works fine, but when it's run on a large data set, the performance suffers and it takes a long time to calculate. I tried tweaking the code, but couldn't gain more performance. How can I write this function differently to improve performance?

Here is a Colab with the running code.

CodePudding user response：

I think I was able to cut your time in half by using format in pd.to_datetime and also using as_index parameter in groupby instead of calling rest_index:

def average_session_time(**kwargs):
    df = kwargs['df'].copy()
    df['timestamp'] = pd.to_datetime(df.timestamp, format='%Y-%m-%d %H:%M:%S.%f')
    grp_id = df.groupby('Id', as_index=False)
    df['session_grp'] = grp_id.apply(
        lambda x: (x.groupby([pd.Grouper(key="timestamp", freq='5min', origin='start')])).ngroup()).values.reshape(-1)

    # Getting relevant 5min groups
    ng = df.groupby(['Id', 'session_grp'])
    df['fts'] = ng['timestamp'].transform('first')
    df['delta'] = df['timestamp'].sub(df['fts']).dt.total_seconds()

    return grp_id['delta'].expanding().mean().reset_index(level=0, drop=True)

Original Timing:

40.228641986846924

New Timing:

16.08320665359497

CodePudding user response：

One very fast solution is to use Numpy and Numba together to group the consecutive lines the way you do.

First of all, the columns needs to be converted to native Numpy types since CPython objects are very slow to compute (and also take more memory). You can do that with:

ids = df['Id'].values.astype('S32')
timestamps = df['timestamp'].values.astype('datetime64[ns]')

This assume that the IDs are composed of up to 32 ASCII characters. If IDs can contain unicode characters, you can use 'U32' instead (a bit slower). You can also use np.unicode_ to let Numpy find the bound for you. However, this is significantly slower (since Numpy need to parse all the strings twice).

Once converted to datetime64[ns], timestamps can be converted to 64-bit integers for a very fast computation by Numba.

Then, the idea is to transform string IDs to basic integers since working with strings is pretty slow. You can do that by searching for adjacent strings that are different so to locate blocks having the same ID:

ids_int = np.insert(np.cumsum(ids[1:] != ids[:-1], dtype=np.int64), 0, 0)

Note that there are no set of rows sharing the same ID with another row having a different ID in between in the provided dataset. If this assumption is not always true, you can sort the input strings (ids) using an np.argsort(ids, kind='stable'), apply this solution and then reorder the result based on the output of np.argsort. Note that sorting the strings is a bit slow but still much faster than the computation time of the solution provided in the question (about 100-200 ms on my machine).

Finally, you can compute the result with Numba using basic loops.

Complete solution

Here is the resulting code:

import pandas as pd
import numpy as np
import numba as nb

@nb.njit('float64[:](int64[::1], int64[::1])')
def compute_result(ids, timestamps):
    n = len(ids)
    result = np.empty(n, dtype=np.float64)
    if n == 0:
        return result
    id_group_first_timestamp = timestamps[0]
    session_group_first_timestamp = timestamps[0]
    id_group_count = 1
    id_group_delta_sum = 0.0
    last_session_group = 0
    result[0] = 0
    delay = np.int64(300e9) # 5 min (in ns)
    for i in range(1, n):
        # If there is a new group of IDs
        if ids[i-1] != ids[i]:
            id_group_first_timestamp = timestamps[i]
            id_group_delta_sum = 0.0
            id_group_count = 1
            last_session_group = 0
            session_group_first_timestamp = timestamps[i]
        else:
            id_group_count  = 1
        session_group = (timestamps[i] - id_group_first_timestamp) // delay
        # If there is a new session group
        if session_group != last_session_group:
            session_group_first_timestamp = timestamps[i]
        delta = (timestamps[i] - session_group_first_timestamp) * 1e-9
        id_group_delta_sum  = delta
        result[i] = id_group_delta_sum / id_group_count
        last_session_group = session_group
    return result

def fast_average_session_time(df):
    ids = df['Id'].values.astype('S32')
    timestamps = df['timestamp'].values.astype('datetime64[ns]').astype(np.int64)
    ids_int = np.insert(np.cumsum(ids[1:] != ids[:-1], dtype=np.int64), 0, 0)
    return compute_result(ids_int, timestamps)

Note that the output is a Numpy array and not a dataframe but you can easily build a dataframe from a Numpy array with pd.DataFrame(result).

Benchmark

Here is the resulting performance on my machine:

Initial solution:            12_537 ms   (   x1.0)
Scott Boston's solution:      7_431 ms   (   x1.7)
This solution:                   64 ms   ( x195.9)

Time taken by 
compute_result only:              2.5 ms

Thus, this solution is nearly 200 times faster than the initial one.

Note that about 85% of the time is spent in unicode/datetime string parsing that can hardly be optimized. Indeed, this processing is slow because dealing with short unicode strings is inherently expensive on modern processors and also because CPython objects introduce a significant overhead (eg. reference counting). Moreover, this processing cannot be parallelized efficiently because of the CPython GIL and slow inter-process communication. Thus, this code is certainly almost optimal (as long as you use CPython).