Home > OS >  In dataframe, how to speed up recognizing rows that have more than 5 consecutive previous values wit
In dataframe, how to speed up recognizing rows that have more than 5 consecutive previous values wit

Time:11-18

I have a dataframe like this.

       val  consecutive
0   0.0001          0.0
1   0.0008          0.0
2  -0.0001          0.0
3   0.0005          0.0
4   0.0008          0.0
5   0.0002          0.0
6   0.0012          0.0
7   0.0012          1.0
8   0.0007          1.0
9   0.0004          1.0
10  0.0002          1.0
11  0.0000          0.0
12  0.0015          0.0
13 -0.0005          0.0
14 -0.0003          0.0
15  0.0001          0.0
16  0.0001          0.0
17  0.0003          0.0
18 -0.0003          0.0
19 -0.0001          0.0
20  0.0000          0.0
21  0.0000          0.0
22 -0.0008          0.0
23 -0.0008          0.0
24 -0.0001          0.0
25 -0.0006          0.0
26 -0.0010          1.0
27  0.0002          0.0
28 -0.0003          0.0
29 -0.0008          0.0
30 -0.0010          0.0
31 -0.0003          0.0
32 -0.0005          1.0
33 -0.0012          1.0
34 -0.0002          1.0
35  0.0000          0.0
36 -0.0018          0.0
37 -0.0009          0.0
38 -0.0007          0.0
39  0.0000          0.0
40 -0.0011          0.0
41 -0.0006          0.0
42 -0.0010          0.0
43 -0.0015          0.0
44 -0.0012          1.0
45 -0.0011          1.0
46 -0.0010          1.0
47 -0.0014          1.0
48 -0.0011          1.0
49 -0.0017          1.0
50 -0.0015          1.0
51 -0.0010          1.0
52 -0.0014          1.0
53 -0.0012          1.0
54 -0.0004          1.0
55 -0.0007          1.0
56 -0.0011          1.0
57 -0.0008          1.0
58 -0.0006          1.0
59  0.0002          0.0

The column 'consecutive' is what I want to compute. It is '1' when current row has more than 5 consecutive previous values with same sign (either positive or negative, including it self).

What I've tried is:

df['consecutive'] = df['val'].rolling(5).apply(
    lambda arr: np.all(arr > 0) or np.all(arr < 0), raw=True
).replace(np.nan, 0)

But it's too slow for large dataset.

Do you have any idea how to speed up?

CodePudding user response:

One option is to avoid the use of apply() altogether.

The main idea is to create 2 'helper' columns:

  • sign: boolean Series indicating if value is positive (True) or negative (False)
  • id: group identical consecutive occurences together

Finally, we can groupby the id and use cumulative count to isolate the rows which have 4 or more previous rows with the same sign (i.e. get all rows with 5 consecutive sign values).

# Setup test dataset
import pandas as pd
import numpy as np

vals = np.random.randn(20000)
df = pd.DataFrame({'val': vals})

# Create the helper columns
sign = df['val'] >= 0
df['id'] = sign.ne(sign.shift()).cumsum()

# Count the ids and set flag to True if the cumcount is above our desired value
df['consecutive'] = df.groupby('id').cumcount() >= 4

Benchmarking

On my system I get the following benchmarks:

sign = df['val'] >= 0

# 92 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
df['id'] = sign.ne(sign.shift()).cumsum()

# 1.06 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df['consecutive'] = df.groupby('id').cumcount() >= 4

# 3.36 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Thus in total we get an average runtime of: 4.51 ms

For reference, your solution and @Emma 's solution ran respectively on my system in:

# 287 ms ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 121 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response:

Not sure this is fast enough for your data size but using min, max seems faster.

With 20k rows,

df['consecutive'] = df['val'].rolling(5).apply(
    lambda arr: np.all(arr > 0) or np.all(arr < 0), raw=True
)

# 144 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['consecutive'] = df['val'].rolling(5).apply(
    lambda arr: (arr.min() > 0 or arr.max() < 0), raw=True
)

# 57.1 ms ± 85.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • Related