Home > OS >  How can we extract rows with sequential values in dataframe?
How can we extract rows with sequential values in dataframe?

Time:03-23

I wanna select only rows with three or more sequential values.

num1 num2 num3 num4 num5 num6
1 2 3 4 5 6
3 7 8 9 10 11
9 13 21 22 23 24
5 8 11 17 21 24

[Result]

num1 num2 num3 num4 num5 num6
1 2 3 4 5 6
3 7 8 9 10 11
9 13 21 22 23 24

Is there an easy way using pandas or numpy?

CodePudding user response:

IIUC, compute a diff on the columns, select the ones equal to 1 (i.e. consecutive values) (with eq) and determine if there are any 3 consecutive values with rolling.sum. Use the produced Series to perform boolean indexing:

N=3
df[df.diff(axis=1).eq(1).T.rolling(window=N).sum().ge(N).any()]

output:

   num1  num2  num3  num4  num5  num6
0     1     2     3     4     5     6
1     3     7     8     9    10    11
2     9    13    21    22    23    24

intermediate count of consecutive values:

>>> df.diff(axis=1).eq(1).T.rolling(window=3).sum()

        0    1    2    3
num1  NaN  NaN  NaN  NaN
num2  NaN  NaN  NaN  NaN
num3  2.0  1.0  0.0  0.0
num4  3.0  2.0  1.0  0.0
num5  3.0  3.0  2.0  0.0
num6  3.0  3.0  3.0  0.0
previous incorrect answer (common to everyone in this Q/A)

This was only counting the number of successive values, not if they all are consecutive. For instance 1-2-4-5-7-8 would count 3 consecutive whereas they are not all consecutive*

N = 3
df1 = df[df.diff(axis=1).eq(1).sum(axis=1).ge(N)]

CodePudding user response:

Get differencies per rows by DataFrame.diff, compare for 1 with DataFrame.eq, count Trues by sum and compare if greater or equal by Series.ge:

df1 = df[df.diff(axis=1).eq(1).sum(axis=1).ge(3)]

EDIT: You can count consecutive 1 of differencies by cumulative sum, then get maximal values and compare for greater or equal:

print (df)
   num1  num2  num3  num4  num5  num6
0     1     2     4     5     8     9
1     3     7     8     9    10    11
2     9    13    21    22    23    24
3     5     8    11    17    21    24

df1 = df.diff(axis=1)
m = df1.eq(1)
b = m.cumsum(axis=1)
mask = b.sub(b.mask(m).ffill(axis=1).fillna(0)).max(axis=1).ge(3)
df2 = df[mask]
print (df2)
   num1  num2  num3  num4  num5  num6
1     3     7     8     9    10    11
2     9    13    21    22    23    24

Details:

print (b.sub(b.mask(m).ffill(axis=1).fillna(0)))
   num1  num2  num3  num4  num5  num6
0   0.0   1.0   0.0   1.0   0.0   1.0
1   0.0   0.0   1.0   2.0   3.0   4.0
2   0.0   0.0   0.0   1.0   2.0   3.0
3   0.0   0.0   0.0   0.0   0.0   0.0

Performance (best test in real data), but generally rolling here is bottleneck:

#40k rows
df = pd.concat([df] * 10000, ignore_index=True)

In [82]: %%timeit
    ...: df1 = df.diff(axis=1)
    ...: m = df1.eq(1)
    ...: b = m.cumsum(axis=1)
    ...: mask = b.sub(b.mask(m).ffill(axis=1).fillna(0)).max(axis=1).ge(3)
    ...: df[mask]
    ...: 
    ...: 
35.6 ms ± 475 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [83]: %%timeit
    ...: df[df.diff(axis=1).eq(1).T.rolling(window=N).sum().ge(N).any()]
    ...: 
2.79 s ± 63.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response:

try this :

df = df[df.diff(axis=1).eq(1).sum(axis=1).ge(3)]
  • Related