I have a data frame that looks like this:
Tag
0 skip_1
1 run
2 skip_1
3 run
4 skip_1
5 run
6 skip_2
7 run
8 skip_1
9 run
10 skip_2
11 jump
12 skip_1
13 run
14 skip_2
15 jump
16 skip_1
17 run
18 skip_2
19 cleanup_jump
20 skip_1
21 run
22 skip_2
23 run
24 skip_2
25 jump
26 skip_1
27 run
28 skip_2
29 jump
First, I would like to count the RUN occurrences between two JUMP events, then to enumerate this occurrences from the latest to the earliest within this range. The expected results would be:
Tag Jump_Run_Count Run_Order
0 skip_1 0 0
1 run 0 5
2 skip_1 0 0
3 run 0 4
4 skip_1 0 0
5 run 0 3
6 skip_2 0 0
7 run 0 2
8 skip_1 0 0
9 run 0 1
10 skip_2 0 0
11 jump 5 0
12 skip_1 0 0
13 run 0 1
14 skip_2 0 0
15 jump 1 0
16 skip_1 0 0
17 run 0 0
18 skip_2 0 0
19 cleanup_jump 0 0
20 skip_1 0 0
21 run 0 2
22 skip_2 0 0
23 run 0 1
24 skip_2 0 0
25 jump 2 0
26 skip_1 0 0
27 run 0 1
28 skip_2 0 0
29 jump 1 0
One of the problems here is that the first RUN occurrences are not within 2 JUMP but are between the first JUMP and the beginning of the column.
Secondly I would like to do the same count and enumerate for a CLEANUP_JUMP and JUMP range, and store it in separate columns.
Tag Jump_Run_Count Run_Order Cleanup_Jump_Dig_Count Run_Order2
0 skip_1 0 0 0 0
1 run 0 5 0 0
2 skip_1 0 0 0 0
3 run 0 4 0 0
4 skip_1 0 0 0 0
5 run 0 3 0 0
6 skip_2 0 0 0 0
7 run 0 2 0 0
8 skip_1 0 0 0 0
9 run 0 1 0 0
10 skip_2 0 0 0 0
11 jump 5 0 0 0
12 skip_1 0 0 0 0
13 run 0 1 0 0
14 skip_2 0 0 0 0
15 jump 1 0 0 0
16 skip_1 0 0 0 0
17 run 0 0 0 1
18 skip_2 0 0 0 0
19 cleanup_jump 0 0 1 0
20 skip_1 0 0 0 0
21 run 0 2 0 0
22 skip_2 0 0 0 0
23 run 0 1 0 0
24 skip_2 0 0 0 0
25 jump 2 0 0 0
26 skip_1 0 0 0 0
27 run 0 1 0 0
28 skip_2 0 0 0 0
29 jump 1 0 0 0
I have added some pictures that might explain it better:
Any help on how to code this, or even another way to approach this issue will be highly appreciated.
Thanks!
CodePudding user response:
Here is a solution using pandas:
import pandas as pd
import numpy as np
df['run'] = df['Tag'] == 'run'
val_mask = df['Tag'].replace({'cleanup_jump':'jump'}) == 'jump'
df['tag_id'] = (val_mask).cumsum()
df.loc[val_mask, 'Jump_Count'] = df.groupby('tag_id')['run'].sum().to_numpy()[:-1]
df.loc[df['run'], 'run_per_jump'] = df.loc[df['run']].groupby('tag_id')['run'].cumsum()
df['Jump_Run_Order'] = df.groupby('tag_id')['run_per_jump'].rank(method='dense', ascending=False)
jumps_idx = np.flatnonzero(df['Tag'] == 'jump')
cj_idxs = np.flatnonzero(df['Tag'] == 'cleanup_jump')
cj_help_idxs = np.asarray([np.max(jumps_idx[jumps_idx < cj_idx]) for cj_idx in cj_idxs])
for start, end in zip(cj_help_idxs 1, cj_idxs):
df.loc[start:end, 'Cleanup_Jump_Count'] = df.loc[start:end, 'Jump_Count']
df.loc[start:end, 'Cleanup_Jump_Run_Order'] = df.loc[start:end, 'Jump_Run_Order']
df.loc[start:end, 'Jump_Run_Order'] = 0
df.loc[start:end, 'Jump_Count'] = 0
df = df.drop(columns=['tag_id', 'run', 'run_per_jump']).fillna(0).convert_dtypes(convert_integer=True)
print(df)
Tag Jump_Count Jump_Run_Order Cleanup_Jump_Run_Order Cleanup_Jump_Count
0 skip_1 0 0 0 0
1 run 0 5 0 0
2 skip_1 0 0 0 0
3 run 0 4 0 0
4 skip_1 0 0 0 0
5 run 0 3 0 0
6 skip_2 0 0 0 0
7 run 0 2 0 0
8 skip_1 0 0 0 0
9 run 0 1 0 0
10 skip_2 0 0 0 0
11 jump 5 0 0 0
12 skip_1 0 0 0 0
13 run 0 1 0 0
14 skip_2 0 0 0 0
15 jump 1 0 0 0
16 skip_1 0 0 0 0
17 run 0 0 1 0
18 skip_2 0 0 0 0
19 cleanup_jump 0 0 0 1
20 skip_1 0 0 0 0
21 run 0 2 0 0
22 skip_2 0 0 0 0
23 run 0 1 0 0
24 skip_2 0 0 0 0
25 jump 2 0 0 0
26 skip_1 0 0 0 0
27 run 0 1 0 0
28 skip_2 0 0 0 0
29 jump 1 0 0 0