Home > Blockchain >  How to remove rows so that the values in a column match a sequence
How to remove rows so that the values in a column match a sequence

Time:01-16

I'm looking for a more efficient method to deal with the following problem. I have a Dataframe with a column filled with values that randomly range from 1 to 4, I need to remove all the rows of the data frame that do not follow the sequence (1-2-3-4-1-2-3-...). This is what I have:

              A     B
12/2/2022    0.02   2
14/2/2022    0.01   1
15/2/2022    0.04   4
16/2/2022   -0.02   3
18/2/2022   -0.01   2
20/2/2022    0.04   1
21/2/2022    0.02   3
22/2/2022   -0.01   1
24/2/2022    0.04   4
26/2/2022   -0.02   2
27/2/2022    0.01   3
28/2/2022    0.04   1
01/3/2022   -0.02   3
03/3/2022   -0.01   2
05/3/2022    0.04   1
06/3/2022    0.02   3
08/3/2022   -0.01   1
10/3/2022    0.04   4
12/3/2022   -0.02   2
13/3/2022    0.01   3
15/3/2022    0.04   1
      ...

This is what I need:

              A     B
14/2/2022    0.01   1
18/2/2022   -0.01   2
21/2/2022    0.02   3
24/2/2022    0.04   4
28/2/2022    0.04   1
03/3/2022   -0.01   2
06/3/2022    0.02   3
10/3/2022    0.04   4
15/3/2022    0.04   1
        ...

Since the data frame is quite big I need some sort of NumPy-based operation to accomplish this, the more efficient the better. My solution is very ugly and inefficient, basically, I made 4 loops like the following to check for every part of the sequence (4-1,1-2,2-3,3-4):

df_len = len(df)
df_len2 = 0
while df_len != df_len2:
    df_len = len(df)
    df.loc[(df.B.shift(1) == 4) & (df.B != 1), 'B'] = 0
    df = df[df['B'] != 0]
    df_len2 = len(df)

CodePudding user response:

By means of itertools.cycle (to define cycled range):

from itertools import cycle

c_rng = cycle(range(1, 5))  # cycled range
start = next(c_rng)  # starting point
df[[(v == start) and bool(start := next(c_rng)) for v in df.B]]

              A  B
                 
14/2/2022  0.01  1
18/2/2022 -0.01  2
21/2/2022  0.02  3
24/2/2022  0.04  4
28/2/2022  0.04  1
03/3/2022 -0.01  2
06/3/2022  0.02  3
10/3/2022  0.04  4
15/3/2022  0.04  1

CodePudding user response:

A simple improvement to speed this up is to not touch the dataframe within the loop, but just iterate over the values of B to construct a Boolean index, like this:

is_in_sequence = []
next_target = 1

for b in df.B:
    if b == next_target:
        is_in_sequence.append(True)
        next_target = next_target % 4   1
    else:
        is_in_sequence.append(False)
        
print(df[is_in_sequence])
             A     B
14/2/2022    0.01   1
18/2/2022   -0.01   2
21/2/2022    0.02   3
24/2/2022    0.04   4
28/2/2022    0.04   1
03/3/2022   -0.01   2
06/3/2022    0.02   3
10/3/2022    0.04   4
15/3/2022    0.04   1
  • Related