I wanna select only rows with three or more sequential values.
num1 | num2 | num3 | num4 | num5 | num6 |
---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 |
3 | 7 | 8 | 9 | 10 | 11 |
9 | 13 | 21 | 22 | 23 | 24 |
5 | 8 | 11 | 17 | 21 | 24 |
[Result]
num1 | num2 | num3 | num4 | num5 | num6 |
---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 |
3 | 7 | 8 | 9 | 10 | 11 |
9 | 13 | 21 | 22 | 23 | 24 |
Is there an easy way using pandas or numpy?
CodePudding user response:
IIUC, compute a diff
on the columns, select the ones equal to 1 (i.e. consecutive values) (with eq
) and determine if there are any 3 consecutive values with rolling.sum
. Use the produced Series to perform boolean indexing:
N=3
df[df.diff(axis=1).eq(1).T.rolling(window=N).sum().ge(N).any()]
output:
num1 num2 num3 num4 num5 num6
0 1 2 3 4 5 6
1 3 7 8 9 10 11
2 9 13 21 22 23 24
intermediate count of consecutive values:
>>> df.diff(axis=1).eq(1).T.rolling(window=3).sum()
0 1 2 3
num1 NaN NaN NaN NaN
num2 NaN NaN NaN NaN
num3 2.0 1.0 0.0 0.0
num4 3.0 2.0 1.0 0.0
num5 3.0 3.0 2.0 0.0
num6 3.0 3.0 3.0 0.0
previous incorrect answer (common to everyone in this Q/A)
This was only counting the number of successive values, not if they all are consecutive. For instance 1-2-4-5-7-8 would count 3 consecutive whereas they are not all consecutive*
N = 3
df1 = df[df.diff(axis=1).eq(1).sum(axis=1).ge(N)]
CodePudding user response:
Get differencies per rows by DataFrame.diff
, compare for 1
with DataFrame.eq
, count True
s by sum
and compare if greater or equal by Series.ge
:
df1 = df[df.diff(axis=1).eq(1).sum(axis=1).ge(3)]
EDIT: You can count consecutive 1
of differencies by cumulative sum
, then get maximal values and compare for greater or equal:
print (df)
num1 num2 num3 num4 num5 num6
0 1 2 4 5 8 9
1 3 7 8 9 10 11
2 9 13 21 22 23 24
3 5 8 11 17 21 24
df1 = df.diff(axis=1)
m = df1.eq(1)
b = m.cumsum(axis=1)
mask = b.sub(b.mask(m).ffill(axis=1).fillna(0)).max(axis=1).ge(3)
df2 = df[mask]
print (df2)
num1 num2 num3 num4 num5 num6
1 3 7 8 9 10 11
2 9 13 21 22 23 24
Details:
print (b.sub(b.mask(m).ffill(axis=1).fillna(0)))
num1 num2 num3 num4 num5 num6
0 0.0 1.0 0.0 1.0 0.0 1.0
1 0.0 0.0 1.0 2.0 3.0 4.0
2 0.0 0.0 0.0 1.0 2.0 3.0
3 0.0 0.0 0.0 0.0 0.0 0.0
Performance (best test in real data), but generally rolling
here is bottleneck:
#40k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [82]: %%timeit
...: df1 = df.diff(axis=1)
...: m = df1.eq(1)
...: b = m.cumsum(axis=1)
...: mask = b.sub(b.mask(m).ffill(axis=1).fillna(0)).max(axis=1).ge(3)
...: df[mask]
...:
...:
35.6 ms ± 475 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [83]: %%timeit
...: df[df.diff(axis=1).eq(1).T.rolling(window=N).sum().ge(N).any()]
...:
2.79 s ± 63.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CodePudding user response:
try this :
df = df[df.diff(axis=1).eq(1).sum(axis=1).ge(3)]