Python Pandas: From a series of Cycles using 1 to toggle On/Off and get the 2 largest numbers only f-CodePudding

I have a dataset that i need to get the highest two values per each "cycle". Using the sample code below, I was wondering how I could code this to be possible:

import pandas as pd

data = {'Pressure' : [100,112,114,120,123,420,123,1230,132,1,23,13,13,13,123,13,123,3,222,2303,1233,1233,1,1,30,20,40,401,10,40,12,122,1,12,333],
       'Machine On/Off' : [0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0]}

df = pd.DataFrame(data)

Using this data as an example, when the 'Machine On/Off" is a 1 it means that the machine is turned on, when it becomes a 1 again it means its off. What type of code would be possible to track each of the two highest values in every cycle?

In the example above, the machine begins its first cycle at 120 pressure, then turns off at 132 pressure, the two highest values it should read are 420 and 1230. The next cycle then begins at 13 pressure and ends at 3 pressure, so this cycles two highest values are 123 and 123.

Does anyone know how to write a code that will give me the two highest values for each cycle for every cycle in this list?

I tried using this code, but it only gives me the two highest in the whole column, how can i make it give me the two highest for each cycle?

highest = df['Pressure'].nlargest(2)

CodePudding user response：

We can set group numbers for each period of machine on/off toggle.

Then, select the odd-numbered groups corresponding to machine turned on.

Finally, get the largest 2 within the turned on groups.

# Set group numbers
df['group'] = df['Machine On/Off'].cumsum()

# Select only groups with machine turned on
df2 = df.loc[df['group'] % 2 == 1]

# Get largest 2 Pressure within selected groups
df2.groupby('group')['Pressure'].nlargest(2)

Result:

group    
1      7     1230
       5      420
3      14     123
       16     123
5      27     401
       26      40
Name: Pressure, dtype: int64

In the output above, the left is the group number. The middle is the original row index. The right is the 2 largest Pressure is descending order.

If you want to make the output a dataframe for better readability, you can further use:

df2.groupby('group')['Pressure'].nlargest(2).rename_axis(index=['group', 'row_index']).reset_index()

Result:

   group  row_index  Pressure
0      1          7      1230
1      1          5       420
2      3         14       123
3      3         16       123
4      5         27       401
5      5         26        40

Update

If you want to put the 2 largest numbers back into the original dataframe, you can use:

df_top2 = (df2.groupby('group')['Pressure']
              .agg(lambda x: x.nlargest(2).tolist())
              .reset_index(name='Top_2_Pressures')
          )

df.merge(df_top2, on='group', how='left')

Result:

    Pressure  Machine On/Off  group Top_2_Pressures
0        100               0      0             NaN
1        112               0      0             NaN
2        114               0      0             NaN
3        120               1      1     [1230, 420]
4        123               0      1     [1230, 420]
5        420               0      1     [1230, 420]
6        123               0      1     [1230, 420]
7       1230               0      1     [1230, 420]
8        132               1      2             NaN
9          1               0      2             NaN
10        23               0      2             NaN
11        13               0      2             NaN
12        13               0      2             NaN
13        13               1      3      [123, 123]
14       123               0      3      [123, 123]
15        13               0      3      [123, 123]
16       123               0      3      [123, 123]
17         3               1      4             NaN
18       222               0      4             NaN
19      2303               0      4             NaN
20      1233               0      4             NaN
21      1233               0      4             NaN
22         1               0      4             NaN
23         1               0      4             NaN
24        30               0      4             NaN
25        20               1      5       [401, 40]
26        40               0      5       [401, 40]
27       401               0      5       [401, 40]
28        10               0      5       [401, 40]
29        40               0      5       [401, 40]
30        12               0      5       [401, 40]
31       122               1      6             NaN
32         1               0      6             NaN
33        12               0      6             NaN
34       333               0      6             NaN

Breakdown of Steps

After setting group numbers, we get:

print(df)

    Pressure  Machine On/Off  group
0        100               0      0
1        112               0      0
2        114               0      0
3        120               1      1
4        123               0      1
5        420               0      1
6        123               0      1
7       1230               0      1
8        132               1      2
9          1               0      2
10        23               0      2
11        13               0      2
12        13               0      2
13        13               1      3
14       123               0      3
15        13               0      3
16       123               0      3
17         3               1      4
18       222               0      4
19      2303               0      4
20      1233               0      4
21      1233               0      4
22         1               0      4
23         1               0      4
24        30               0      4
25        20               1      5
26        40               0      5
27       401               0      5
28        10               0      5
29        40               0      5
30        12               0      5
31       122               1      6
32         1               0      6
33        12               0      6
34       333               0      6

Then, after filtering the groups with machine turned on:

print(df2)

    Pressure  Machine On/Off  group
3        120               1      1
4        123               0      1
5        420               0      1
6        123               0      1
7       1230               0      1
13        13               1      3
14       123               0      3
15        13               0      3
16       123               0      3
25        20               1      5
26        40               0      5
27       401               0      5
28        10               0      5
29        40               0      5
30        12               0      5

CodePudding user response：

My solution to this is to keep track of when the machine is on or off, and iterate over the rows of the table. When the machine is on, put the pressure values into a list in a dict, where we make a new dict key whenever the machine is switched on. Dict keys are just ints counting up from a value of your choice.

I use defaultdict here as generally I find it's nicer to use when generating dictionaries on the fly like this.

import pandas as pd
from collections import defaultdict

data = {'Pressure' :      [100,112,114,120,123,420,123,1230,132,1,23,13,13,13,123,13,123,3,222,2303,1233,1233,1,1,30,20,40,401,10,40,12,122,1,12,333],
       'Machine On/Off' : [0,  0,  0,  1,  0,  0,  0,  0,   1,  0,0, 0, 0, 1, 0,  0, 0,  1,0,  0,   0,   0,   0,0,0, 1, 0, 0,  0, 0, 0, 1,  0, 0,0]}
       #                               1111111111111111111111--------------222222222222222---------------------------33333333333333333333---------

df = pd.DataFrame(data)


# filter for just when the machine is switched on. We'll use a dict to keep track of each set of data
pressure_on = defaultdict(list)
machine_on = False  # assuming the machine is off to start
loop_counter = 0
for index, row in df.iterrows():
    # print(row[0], row[1])
    if row['Machine On/Off'] == 1:
        if machine_on:  # don't want to increment twice!
            loop_counter  = 1
        machine_on = not machine_on
    if machine_on:
        pressure_on[loop_counter].append(row['Pressure'])

# find the highest 2 values from each set:
pressure_on_max = defaultdict(list)   
for key, value in pressure_on.items():
    value.sort(reverse=True)
    pressure_on_max[key] = value[0:2]
print(pressure_on_max)

Which gives a defaultdict that looks like

defaultdict(<class 'list'>, {0: [1230, 420], 1: [123, 123], 2: [401, 40]})

which you can do whatever you want with, just treat it like a dictionary.

I believe this will not count the value measured at the same time as the machine is switched off, but it will count when it's turned on. I think you can resolve that if you wish by swapping the "if machine_on" section to be before the "if row['Machine On/Off'] == 1" section.