I have a Dataframe file that looks like below (imagine so many other rows):
DateTime hours uptime Start_stop machine_started Voltage
2022-02-22 07:50:58;269 18,7 43269 221 0 800
2022-02-22 07:50:59;614 18,7 44609 221 0 800
2022-02-22 07:50:59;614 18,7 44609 221 1 800
2022-02-22 07:51:01;915 18,7 46904 221 1 700
2022-02-22 07:51:01;915 18,7 46904 221 1 500
2022-02-22 08:16:20;343 18,7 48523 221 0 800
2022-02-22 08:16:20;638 18,7 48523 221 0 800
2022-02-23 07:51:01;915 18,7 46904 221 1 800
2022-02-23 07:51:03;543 18,7 48523 221 1 1000
2022-02-23 08:16:20;343 18,7 48523 221 0 800
2022-02-23 08:16:20;638 18,7 48523 221 0 800
I need a logic to say when the machine is working, from the moment the (machine_started==1) start taking the rows until it ends (machine_started==0) and calculate the average of those rows, standard deviation and the length of that cycle.
I would like to have the Datetime tag that shows the process start time. As you can see for this example I have two times that the process starts.
So in the final file I will see a number of rows which corresponds to the number of cycles in that file.
Unfortunately I can just think of doing one task at a time and also after finding the first cycle I get stuck and cannot count the rows or perform these calculation.
So at the I would like to see something like this:
DateTime Average_Voltage standard_deviation length_of_the_cycle
2022-02-22 07:50:59;614 666 xxx YYY
2022-02-23 07:51:01;915 900 xxx YYY
How can this be acheived?
CodePudding user response:
First identify the groups that you are interested in and remove all other rows. To get the groups, we can use shift
and cumsum
. For the filtering, we need to consider whether the first row starts with machine_started
0 or 1. This determines if we want the odd or even numbered groups.
Code for this part:
df['grp'] = df['machine_started'].ne(df['machine_started'].shift(1)).cumsum()
start = df['machine_started'].iloc[0]
df = df.loc[(df['grp'] start) % 2 == 0]
Intermediate result:
DateTime hours uptime Start_stop machine_started Voltage grp
2 2022-02-22 07:50:59;614 18,7 44609 221 1 800 2
3 2022-02-22 07:51:01;915 18,7 46904 221 1 700 2
4 2022-02-22 07:51:01;915 18,7 46904 221 1 500 2
7 2022-02-23 07:51:01;915 18,7 46904 221 1 800 4
8 2022-02-23 07:51:03;543 18,7 48523 221 1 1000 4
Now, we can groupby
and aggregate the information needed. Finally, rename the obtained columns:
df = df.groupby('grp').agg({'DateTime': ['first'], 'Voltage': ['mean', 'std', 'count']}).reset_index(drop=True)
df.columns = ['DateTime', 'Average_Voltage', 'standard_deviation', 'length_of_the_cycle']
Result:
DateTime Average_Voltage standard_deviation length_of_the_cycle
0 2022-02-22 07:50:59;614 666.666667 152.752523 3
1 2022-02-23 07:51:01;915 900.000000 141.421356 2