How to avoid iterrows for this pandas dataframe-CodePudding

I need your recommendations to improve the following code. The main aim of this is to calculate the time to start the machine based on the MachineState column. The actual code needs about 16–18 minutes to iterate through a dataframe of about 100 000 rows × 700 columns, which is too long for me.

ShutDownMask = df['MachineState'] == 'Shut Down'
ShutDownPos = np.flatnonzero(ShutDownMask)
# Create mask with Starting state, get index numbers/positions 
StartingMask = df['MachineState'] == 'Starting'
# Index list 
StartingPos = np.flatnonzero(StartingMask)
for index, row in df.iterrows():
    if row['MachineState'] == 'Shut Down':
        start = pd.to_datetime(row['Date'])
        try:
            idx = df.iloc[StartingPos].index[df.iloc[StartingPos].index.get_loc(start, method='backfill')]
            df.loc[index,'TimeToStart'] = idx - start
         except:
            print ('Something went wrong to find last index IDX') #For last set of record 
            pass

I have tried the recommended option with np.where but without success as I don’t know how to implement the counter to the next index.

df['TimeToStart'] = np.where(df['Machinetate'] == 'Shut Down',df.iloc[StartingPos].index[df.iloc[StartingPos].index.get_loc(pd.to_datetime(df['Date']), method='backfill')],pd.na)

The dataframe looks like this:

Date	MachineState	TimeToStart
10/02/2021 10:30:00	Shut Down	0 days 00:30:00
10/02/2021 10:40:00	Shut Down	0 days 00:20:00
10/02/2021 10:50:00	Shut Down	0 days 00:10:00
10/02/2021 11:00:00	Starting

CodePudding user response：

You could try the following (df your dataframe):

def time_to_start(sdf):
    if sdf.MachineState.iat[-1] == "Starting":
        sdf["TimeToStart"] = sdf.Date.iat[-1] - sdf.Date
    else:
        sdf["TimeToStart"] = pd.NaT
    return sdf

df.Date = pd.to_datetime(df.Date)  # Just to make sure
df = (
    df.groupby(
        df.MachineState.shift().eq("Starting").cumsum(),
        as_index=False,
        sort=False
      )
      .apply(time_to_start)
)

Result for

                  Date MachineState
0  10/02/2021 10:30:00    Shut Down
1  10/02/2021 10:40:00    Shut Down
2  10/02/2021 10:50:00    Shut Down
3  10/02/2021 11:00:00     Starting
4  10/02/2021 11:10:00    Shut Down
5  10/02/2021 11:30:00    Shut Down
6  10/02/2021 12:00:00     Starting
7  10/02/2021 12:40:00    Shut Down

                 Date MachineState      TimeToStart
0 2021-10-02 10:30:00    Shut Down  0 days 00:30:00
1 2021-10-02 10:40:00    Shut Down  0 days 00:20:00
2 2021-10-02 10:50:00    Shut Down  0 days 00:10:00
3 2021-10-02 11:00:00     Starting  0 days 00:00:00
4 2021-10-02 11:10:00    Shut Down  0 days 00:50:00
5 2021-10-02 11:30:00    Shut Down  0 days 00:30:00
6 2021-10-02 12:00:00     Starting  0 days 00:00:00
7 2021-10-02 12:40:00    Shut Down              NaT

Some explanations:

One method to identify the blocks of rows that belong together is the following: First check in the MachineState-column if the row before contains 'Starting' (-> Grouping_1) and then take the cumulative sum (False/True = 0/1; -> Grouping_2):

df["Grouping_1"] = df.MachineState.shift().eq("Starting")
df["Grouping_2"] = df.Grouping_1.cumsum()

                  Date MachineState  Grouping_1  Grouping_2
0  10/02/2021 10:30:00    Shut Down       False           0
1  10/02/2021 10:40:00    Shut Down       False           0
2  10/02/2021 10:50:00    Shut Down       False           0
3  10/02/2021 11:00:00     Starting       False           0
4  10/02/2021 11:10:00    Shut Down        True           1
5  10/02/2021 11:30:00    Shut Down       False           1
6  10/02/2021 12:00:00     Starting       False           1
7  10/02/2021 12:40:00    Shut Down        True           2

Now use groupby over this grouping to process the connected blocks of rows by apply.
The time_to_start function used by apply does the following: It first checks if the sub-dataframe actually has a 'Starting'-row. If that's the case, it takes the corresponding time and subtracts the 'Shut Down'-times to build the TimeToStart-column. If that's not the case: Just indicate that by setting the TimeToStart-column to NaT.