Speeding up a pd.apply() function-CodePudding

I have some example data

    import numpy as np
    import pandas as pd

    users = 5
    size = users*6

    df = pd.DataFrame(
        {'userid': np.random.choice(np.arange(0, users), size), 
        'a_time': np.random.normal(loc = 1.5, scale = 0.5, size = size), 
        'b_time': np.random.normal(loc = 1.5, scale = 0.5, size = size), 
        }
    )
    df['focus'] = np.where(df.userid % 2 == 0, 'a', 'b')

    test_dat = df[['userid', 'focus', 'a_time', 'b_time']].sort_values('userid').copy(deep = True).reset_index(drop = True)

For each userid, I need to determine how many times a_time > b_time or vice versa, depending on column focus.

I have a custom function

def some_func(x):
    if (x.focus == 'a').all():
        a = x.a_time
        b = x.b_time
        x['changes'] = (b > a).sum()
        x['days'] = len(a)
    elif (x.focus == 'b').all():
        a = x.a_time
        b = x.b_time
        x['changes'] = (a > b).sum()
        x['days'] = len(a)
    elif (x.focus == 'both').all():
        x['changes'] = 0
        x['days'] = len(a)
    else: 
        x['changes'] = None
        x['days'] = None
    
    return x

test_dat.groupby(['userid', 'focus']).apply(some_func).reset_index(name = 'n_changes')

that works just fine when the number of userid is small. However, as the number of unique userid increases to >100K, this function is almost unbearably slow.

Is there a way to either speed this fx? My guess is that there might be an alternative to the if-elif-else syntax in some_func() but I'm not sure what that syntax might be. The number of rows for each userid is arbitrarily long.

I'm open to non-pandas options if necessary.

CodePudding user response：

A possible solution, that, for users = 110000, produces the result almost instantaneously. The idea is to create an auxiliary boolean column with the comparison a_time > b_time and then using groupby and sum.

(df.assign(aux = df.a_time.gt(df.b_time)).groupby(['userid', 'focus'])['aux']
 .agg([sum, np.size]).reset_index()
 .rename(columns={'sum': 'changes', 'size': 'days'}))

Output:

        userid focus  changes  days
0            0     a        6    11
1            1     b        4     5
2            2     a        5     6
3            3     b        1     7
4            4     a        3     6
...        ...   ...      ...   ...
109732  109995     b        1     5
109733  109996     a        2     6
109734  109997     b        3     3
109735  109998     a        4     8
109736  109999     b        6     8

[109737 rows x 4 columns]

CodePudding user response：

It looks like the if statements in the some_func function are causing it to be slow when the number of unique userids is large. One way to make this function faster is to avoid using if statements and instead use the loc method of the DataFrame to subset the data based on the value of focus. Here is an example:

def some_func(x):
# If the focus is 'a', select only the rows where focus is 'a'
if x.focus.iloc[0] == 'a':
    a = x.a_time
    b = x.b_time
    changes = (b > a).sum()
    days = len(a)
# If the focus is 'b', select only the rows where focus is 'b'
elif x.focus.iloc[0] == 'b':
    a = x.a_time
    b = x.b_time
    changes = (a > b).sum()
    days = len(a)
# If the focus is 'both', select only the rows where focus is 'both'
elif x.focus.iloc[0] == 'both':
    changes = 0
    days = len(a)
else: 
    changes = None
    days = None

return pd.Series({'changes': changes, 'days': days})

Note that this function returns a Series object instead of modifying the input DataFrame. Also, it does not use the all method of the Series object, which can be slow when the Series has a large number of elements.

You can then apply this function to your data like this:

test_dat.groupby(['userid', 'focus']).apply(some_func).reset_index()

This should make the some_func function run much faster when the number of unique userids is large.