I'm looking for a more efficient method to deal with the following problem. I have a Dataframe with a column filled with values that randomly range from 1 to 4, I need to remove all the rows of the data frame that do not follow the sequence (1-2-3-4-1-2-3-...). This is what I have:
A B
12/2/2022 0.02 2
14/2/2022 0.01 1
15/2/2022 0.04 4
16/2/2022 -0.02 3
18/2/2022 -0.01 2
20/2/2022 0.04 1
21/2/2022 0.02 3
22/2/2022 -0.01 1
24/2/2022 0.04 4
26/2/2022 -0.02 2
27/2/2022 0.01 3
28/2/2022 0.04 1
01/3/2022 -0.02 3
03/3/2022 -0.01 2
05/3/2022 0.04 1
06/3/2022 0.02 3
08/3/2022 -0.01 1
10/3/2022 0.04 4
12/3/2022 -0.02 2
13/3/2022 0.01 3
15/3/2022 0.04 1
...
This is what I need:
A B
14/2/2022 0.01 1
18/2/2022 -0.01 2
21/2/2022 0.02 3
24/2/2022 0.04 4
28/2/2022 0.04 1
03/3/2022 -0.01 2
06/3/2022 0.02 3
10/3/2022 0.04 4
15/3/2022 0.04 1
...
Since the data frame is quite big I need some sort of NumPy-based operation to accomplish this, the more efficient the better. My solution is very ugly and inefficient, basically, I made 4 loops like the following to check for every part of the sequence (4-1,1-2,2-3,3-4):
df_len = len(df)
df_len2 = 0
while df_len != df_len2:
df_len = len(df)
df.loc[(df.B.shift(1) == 4) & (df.B != 1), 'B'] = 0
df = df[df['B'] != 0]
df_len2 = len(df)
CodePudding user response:
By means of itertools.cycle
(to define cycled range):
from itertools import cycle
c_rng = cycle(range(1, 5)) # cycled range
start = next(c_rng) # starting point
df[[(v == start) and bool(start := next(c_rng)) for v in df.B]]
A B
14/2/2022 0.01 1
18/2/2022 -0.01 2
21/2/2022 0.02 3
24/2/2022 0.04 4
28/2/2022 0.04 1
03/3/2022 -0.01 2
06/3/2022 0.02 3
10/3/2022 0.04 4
15/3/2022 0.04 1
CodePudding user response:
A simple improvement to speed this up is to not touch the dataframe within the loop, but just iterate over the values of B
to construct a Boolean index, like this:
is_in_sequence = []
next_target = 1
for b in df.B:
if b == next_target:
is_in_sequence.append(True)
next_target = next_target % 4 1
else:
is_in_sequence.append(False)
print(df[is_in_sequence])
A B
14/2/2022 0.01 1
18/2/2022 -0.01 2
21/2/2022 0.02 3
24/2/2022 0.04 4
28/2/2022 0.04 1
03/3/2022 -0.01 2
06/3/2022 0.02 3
10/3/2022 0.04 4
15/3/2022 0.04 1