I need your recommendations to improve the following code. The main aim of this is to calculate the time to start the machine based on the MachineState column. The actual code needs about 16–18 minutes to iterate through a dataframe of about 100 000 rows × 700 columns, which is too long for me.
ShutDownMask = df['MachineState'] == 'Shut Down'
ShutDownPos = np.flatnonzero(ShutDownMask)
# Create mask with Starting state, get index numbers/positions
StartingMask = df['MachineState'] == 'Starting'
# Index list
StartingPos = np.flatnonzero(StartingMask)
for index, row in df.iterrows():
if row['MachineState'] == 'Shut Down':
start = pd.to_datetime(row['Date'])
try:
idx = df.iloc[StartingPos].index[df.iloc[StartingPos].index.get_loc(start, method='backfill')]
df.loc[index,'TimeToStart'] = idx - start
except:
print ('Something went wrong to find last index IDX') #For last set of record
pass
I have tried the recommended option with np.where
but without success as I don’t know how to implement the counter to the next index.
df['TimeToStart'] = np.where(df['Machinetate'] == 'Shut Down',df.iloc[StartingPos].index[df.iloc[StartingPos].index.get_loc(pd.to_datetime(df['Date']), method='backfill')],pd.na)
The dataframe looks like this:
Date | MachineState | TimeToStart |
---|---|---|
10/02/2021 10:30:00 | Shut Down | 0 days 00:30:00 |
10/02/2021 10:40:00 | Shut Down | 0 days 00:20:00 |
10/02/2021 10:50:00 | Shut Down | 0 days 00:10:00 |
10/02/2021 11:00:00 | Starting |
CodePudding user response:
You could try the following (df
your dataframe):
def time_to_start(sdf):
if sdf.MachineState.iat[-1] == "Starting":
sdf["TimeToStart"] = sdf.Date.iat[-1] - sdf.Date
else:
sdf["TimeToStart"] = pd.NaT
return sdf
df.Date = pd.to_datetime(df.Date) # Just to make sure
df = (
df.groupby(
df.MachineState.shift().eq("Starting").cumsum(),
as_index=False,
sort=False
)
.apply(time_to_start)
)
Result for
Date MachineState
0 10/02/2021 10:30:00 Shut Down
1 10/02/2021 10:40:00 Shut Down
2 10/02/2021 10:50:00 Shut Down
3 10/02/2021 11:00:00 Starting
4 10/02/2021 11:10:00 Shut Down
5 10/02/2021 11:30:00 Shut Down
6 10/02/2021 12:00:00 Starting
7 10/02/2021 12:40:00 Shut Down
is
Date MachineState TimeToStart
0 2021-10-02 10:30:00 Shut Down 0 days 00:30:00
1 2021-10-02 10:40:00 Shut Down 0 days 00:20:00
2 2021-10-02 10:50:00 Shut Down 0 days 00:10:00
3 2021-10-02 11:00:00 Starting 0 days 00:00:00
4 2021-10-02 11:10:00 Shut Down 0 days 00:50:00
5 2021-10-02 11:30:00 Shut Down 0 days 00:30:00
6 2021-10-02 12:00:00 Starting 0 days 00:00:00
7 2021-10-02 12:40:00 Shut Down NaT
Some explanations:
One method to identify the blocks of rows that belong together is the following: First check in the
MachineState
-column if the row before contains'Starting'
(->Grouping_1
) and then take the cumulative sum (False
/True
=0
/1
; ->Grouping_2
):df["Grouping_1"] = df.MachineState.shift().eq("Starting") df["Grouping_2"] = df.Grouping_1.cumsum()
Date MachineState Grouping_1 Grouping_2 0 10/02/2021 10:30:00 Shut Down False 0 1 10/02/2021 10:40:00 Shut Down False 0 2 10/02/2021 10:50:00 Shut Down False 0 3 10/02/2021 11:00:00 Starting False 0 4 10/02/2021 11:10:00 Shut Down True 1 5 10/02/2021 11:30:00 Shut Down False 1 6 10/02/2021 12:00:00 Starting False 1 7 10/02/2021 12:40:00 Shut Down True 2
Now use
groupby
over this grouping to process the connected blocks of rows byapply
.The
time_to_start
function used byapply
does the following: It first checks if the sub-dataframe actually has a'Starting'
-row. If that's the case, it takes the corresponding time and subtracts the'Shut Down'
-times to build theTimeToStart
-column. If that's not the case: Just indicate that by setting theTimeToStart
-column toNaT
.