I have some example data
import numpy as np
import pandas as pd
users = 5
size = users*6
df = pd.DataFrame(
{'userid': np.random.choice(np.arange(0, users), size),
'a_time': np.random.normal(loc = 1.5, scale = 0.5, size = size),
'b_time': np.random.normal(loc = 1.5, scale = 0.5, size = size),
}
)
df['focus'] = np.where(df.userid % 2 == 0, 'a', 'b')
test_dat = df[['userid', 'focus', 'a_time', 'b_time']].sort_values('userid').copy(deep = True).reset_index(drop = True)
For each userid
, I need to determine how many times a_time > b_time
or vice versa, depending on column focus
.
I have a custom function
def some_func(x):
if (x.focus == 'a').all():
a = x.a_time
b = x.b_time
x['changes'] = (b > a).sum()
x['days'] = len(a)
elif (x.focus == 'b').all():
a = x.a_time
b = x.b_time
x['changes'] = (a > b).sum()
x['days'] = len(a)
elif (x.focus == 'both').all():
x['changes'] = 0
x['days'] = len(a)
else:
x['changes'] = None
x['days'] = None
return x
test_dat.groupby(['userid', 'focus']).apply(some_func).reset_index(name = 'n_changes')
that works just fine when the number of userid
is small. However, as the number of unique userid
increases to >100K, this function is almost unbearably slow.
Is there a way to either speed this fx? My guess is that there might be an alternative to the if-elif-else
syntax in some_func()
but I'm not sure what that syntax might be. The number of rows for each userid
is arbitrarily long.
I'm open to non-pandas options if necessary.
CodePudding user response:
A possible solution, that, for users = 110000
, produces the result almost instantaneously. The idea is to create an auxiliary boolean column with the comparison a_time > b_time
and then using groupby
and sum
.
(df.assign(aux = df.a_time.gt(df.b_time)).groupby(['userid', 'focus'])['aux']
.agg([sum, np.size]).reset_index()
.rename(columns={'sum': 'changes', 'size': 'days'}))
Output:
userid focus changes days
0 0 a 6 11
1 1 b 4 5
2 2 a 5 6
3 3 b 1 7
4 4 a 3 6
... ... ... ... ...
109732 109995 b 1 5
109733 109996 a 2 6
109734 109997 b 3 3
109735 109998 a 4 8
109736 109999 b 6 8
[109737 rows x 4 columns]
CodePudding user response:
It looks like the if statements in the some_func function are causing it to be slow when the number of unique userids is large. One way to make this function faster is to avoid using if statements and instead use the loc method of the DataFrame to subset the data based on the value of focus. Here is an example:
def some_func(x):
# If the focus is 'a', select only the rows where focus is 'a'
if x.focus.iloc[0] == 'a':
a = x.a_time
b = x.b_time
changes = (b > a).sum()
days = len(a)
# If the focus is 'b', select only the rows where focus is 'b'
elif x.focus.iloc[0] == 'b':
a = x.a_time
b = x.b_time
changes = (a > b).sum()
days = len(a)
# If the focus is 'both', select only the rows where focus is 'both'
elif x.focus.iloc[0] == 'both':
changes = 0
days = len(a)
else:
changes = None
days = None
return pd.Series({'changes': changes, 'days': days})
Note that this function returns a Series object instead of modifying the input DataFrame. Also, it does not use the all method of the Series object, which can be slow when the Series has a large number of elements.
You can then apply this function to your data like this:
test_dat.groupby(['userid', 'focus']).apply(some_func).reset_index()
This should make the some_func function run much faster when the number of unique userids is large.