I have a dataframe like this.
val consecutive
0 0.0001 0.0
1 0.0008 0.0
2 -0.0001 0.0
3 0.0005 0.0
4 0.0008 0.0
5 0.0002 0.0
6 0.0012 0.0
7 0.0012 1.0
8 0.0007 1.0
9 0.0004 1.0
10 0.0002 1.0
11 0.0000 0.0
12 0.0015 0.0
13 -0.0005 0.0
14 -0.0003 0.0
15 0.0001 0.0
16 0.0001 0.0
17 0.0003 0.0
18 -0.0003 0.0
19 -0.0001 0.0
20 0.0000 0.0
21 0.0000 0.0
22 -0.0008 0.0
23 -0.0008 0.0
24 -0.0001 0.0
25 -0.0006 0.0
26 -0.0010 1.0
27 0.0002 0.0
28 -0.0003 0.0
29 -0.0008 0.0
30 -0.0010 0.0
31 -0.0003 0.0
32 -0.0005 1.0
33 -0.0012 1.0
34 -0.0002 1.0
35 0.0000 0.0
36 -0.0018 0.0
37 -0.0009 0.0
38 -0.0007 0.0
39 0.0000 0.0
40 -0.0011 0.0
41 -0.0006 0.0
42 -0.0010 0.0
43 -0.0015 0.0
44 -0.0012 1.0
45 -0.0011 1.0
46 -0.0010 1.0
47 -0.0014 1.0
48 -0.0011 1.0
49 -0.0017 1.0
50 -0.0015 1.0
51 -0.0010 1.0
52 -0.0014 1.0
53 -0.0012 1.0
54 -0.0004 1.0
55 -0.0007 1.0
56 -0.0011 1.0
57 -0.0008 1.0
58 -0.0006 1.0
59 0.0002 0.0
The column 'consecutive' is what I want to compute. It is '1' when current row has more than 5 consecutive previous values with same sign (either positive or negative, including it self).
What I've tried is:
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: np.all(arr > 0) or np.all(arr < 0), raw=True
).replace(np.nan, 0)
But it's too slow for large dataset.
Do you have any idea how to speed up?
CodePudding user response:
One option is to avoid the use of apply()
altogether.
The main idea is to create 2 'helper' columns:
sign
: boolean Series indicating if value is positive (True
) or negative (False
)id
: group identical consecutive occurences together
Finally, we can groupby
the id
and use cumulative count to isolate the rows which have 4 or more previous rows with the same sign (i.e. get all rows with 5 consecutive sign
values).
# Setup test dataset
import pandas as pd
import numpy as np
vals = np.random.randn(20000)
df = pd.DataFrame({'val': vals})
# Create the helper columns
sign = df['val'] >= 0
df['id'] = sign.ne(sign.shift()).cumsum()
# Count the ids and set flag to True if the cumcount is above our desired value
df['consecutive'] = df.groupby('id').cumcount() >= 4
Benchmarking
On my system I get the following benchmarks:
sign = df['val'] >= 0
# 92 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
df['id'] = sign.ne(sign.shift()).cumsum()
# 1.06 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df['consecutive'] = df.groupby('id').cumcount() >= 4
# 3.36 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Thus in total we get an average runtime of: 4.51 ms
For reference, your solution and @Emma 's solution ran respectively on my system in:
# 287 ms ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 121 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
CodePudding user response:
Not sure this is fast enough for your data size but using min
, max
seems faster.
With 20k rows,
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: np.all(arr > 0) or np.all(arr < 0), raw=True
)
# 144 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: (arr.min() > 0 or arr.max() < 0), raw=True
)
# 57.1 ms ± 85.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)