I have a dataset that i need to get the highest two values per each "cycle". Using the sample code below, I was wondering how I could code this to be possible:
import pandas as pd
data = {'Pressure' : [100,112,114,120,123,420,123,1230,132,1,23,13,13,13,123,13,123,3,222,2303,1233,1233,1,1,30,20,40,401,10,40,12,122,1,12,333],
'Machine On/Off' : [0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0]}
df = pd.DataFrame(data)
Using this data as an example, when the 'Machine On/Off" is a 1 it means that the machine is turned on, when it becomes a 1 again it means its off. What type of code would be possible to track each of the two highest values in every cycle?
In the example above, the machine begins its first cycle at 120 pressure, then turns off at 132 pressure, the two highest values it should read are 420 and 1230. The next cycle then begins at 13 pressure and ends at 3 pressure, so this cycles two highest values are 123 and 123.
Does anyone know how to write a code that will give me the two highest values for each cycle for every cycle in this list?
I tried using this code, but it only gives me the two highest in the whole column, how can i make it give me the two highest for each cycle?
highest = df['Pressure'].nlargest(2)
CodePudding user response:
We can set group numbers for each period of machine on/off toggle.
Then, select the odd-numbered groups corresponding to machine turned on.
Finally, get the largest 2 within the turned on groups.
# Set group numbers
df['group'] = df['Machine On/Off'].cumsum()
# Select only groups with machine turned on
df2 = df.loc[df['group'] % 2 == 1]
# Get largest 2 Pressure within selected groups
df2.groupby('group')['Pressure'].nlargest(2)
Result:
group
1 7 1230
5 420
3 14 123
16 123
5 27 401
26 40
Name: Pressure, dtype: int64
In the output above, the left is the group number. The middle is the original row index. The right is the 2 largest Pressure
is descending order.
If you want to make the output a dataframe for better readability, you can further use:
df2.groupby('group')['Pressure'].nlargest(2).rename_axis(index=['group', 'row_index']).reset_index()
Result:
group row_index Pressure
0 1 7 1230
1 1 5 420
2 3 14 123
3 3 16 123
4 5 27 401
5 5 26 40
Update
If you want to put the 2 largest numbers back into the original dataframe, you can use:
df_top2 = (df2.groupby('group')['Pressure']
.agg(lambda x: x.nlargest(2).tolist())
.reset_index(name='Top_2_Pressures')
)
df.merge(df_top2, on='group', how='left')
Result:
Pressure Machine On/Off group Top_2_Pressures
0 100 0 0 NaN
1 112 0 0 NaN
2 114 0 0 NaN
3 120 1 1 [1230, 420]
4 123 0 1 [1230, 420]
5 420 0 1 [1230, 420]
6 123 0 1 [1230, 420]
7 1230 0 1 [1230, 420]
8 132 1 2 NaN
9 1 0 2 NaN
10 23 0 2 NaN
11 13 0 2 NaN
12 13 0 2 NaN
13 13 1 3 [123, 123]
14 123 0 3 [123, 123]
15 13 0 3 [123, 123]
16 123 0 3 [123, 123]
17 3 1 4 NaN
18 222 0 4 NaN
19 2303 0 4 NaN
20 1233 0 4 NaN
21 1233 0 4 NaN
22 1 0 4 NaN
23 1 0 4 NaN
24 30 0 4 NaN
25 20 1 5 [401, 40]
26 40 0 5 [401, 40]
27 401 0 5 [401, 40]
28 10 0 5 [401, 40]
29 40 0 5 [401, 40]
30 12 0 5 [401, 40]
31 122 1 6 NaN
32 1 0 6 NaN
33 12 0 6 NaN
34 333 0 6 NaN
Breakdown of Steps
After setting group numbers, we get:
print(df)
Pressure Machine On/Off group
0 100 0 0
1 112 0 0
2 114 0 0
3 120 1 1
4 123 0 1
5 420 0 1
6 123 0 1
7 1230 0 1
8 132 1 2
9 1 0 2
10 23 0 2
11 13 0 2
12 13 0 2
13 13 1 3
14 123 0 3
15 13 0 3
16 123 0 3
17 3 1 4
18 222 0 4
19 2303 0 4
20 1233 0 4
21 1233 0 4
22 1 0 4
23 1 0 4
24 30 0 4
25 20 1 5
26 40 0 5
27 401 0 5
28 10 0 5
29 40 0 5
30 12 0 5
31 122 1 6
32 1 0 6
33 12 0 6
34 333 0 6
Then, after filtering the groups with machine turned on:
print(df2)
Pressure Machine On/Off group
3 120 1 1
4 123 0 1
5 420 0 1
6 123 0 1
7 1230 0 1
13 13 1 3
14 123 0 3
15 13 0 3
16 123 0 3
25 20 1 5
26 40 0 5
27 401 0 5
28 10 0 5
29 40 0 5
30 12 0 5
CodePudding user response:
My solution to this is to keep track of when the machine is on or off, and iterate over the rows of the table. When the machine is on, put the pressure values into a list
in a dict
, where we make a new dict key whenever the machine is switched on. Dict keys are just int
s counting up from a value of your choice.
I use defaultdict
here as generally I find it's nicer to use when generating dictionaries on the fly like this.
import pandas as pd
from collections import defaultdict
data = {'Pressure' : [100,112,114,120,123,420,123,1230,132,1,23,13,13,13,123,13,123,3,222,2303,1233,1233,1,1,30,20,40,401,10,40,12,122,1,12,333],
'Machine On/Off' : [0, 0, 0, 1, 0, 0, 0, 0, 1, 0,0, 0, 0, 1, 0, 0, 0, 1,0, 0, 0, 0, 0,0,0, 1, 0, 0, 0, 0, 0, 1, 0, 0,0]}
# 1111111111111111111111--------------222222222222222---------------------------33333333333333333333---------
df = pd.DataFrame(data)
# filter for just when the machine is switched on. We'll use a dict to keep track of each set of data
pressure_on = defaultdict(list)
machine_on = False # assuming the machine is off to start
loop_counter = 0
for index, row in df.iterrows():
# print(row[0], row[1])
if row['Machine On/Off'] == 1:
if machine_on: # don't want to increment twice!
loop_counter = 1
machine_on = not machine_on
if machine_on:
pressure_on[loop_counter].append(row['Pressure'])
# find the highest 2 values from each set:
pressure_on_max = defaultdict(list)
for key, value in pressure_on.items():
value.sort(reverse=True)
pressure_on_max[key] = value[0:2]
print(pressure_on_max)
Which gives a defaultdict that looks like
defaultdict(<class 'list'>, {0: [1230, 420], 1: [123, 123], 2: [401, 40]})
which you can do whatever you want with, just treat it like a dictionary.
I believe this will not count the value measured at the same time as the machine is switched off, but it will count when it's turned on. I think you can resolve that if you wish by swapping the "if machine_on" section to be before the "if row['Machine On/Off'] == 1" section.