Pandas - How to find number of equipment failures and the corresponding downtime for each failure-CodePudding

I have some test data which looks like this:

test_data = {'equipment_barcode': ['0001', '0001', '0001', '0001', '0002', '0002', '0002', '0002'], 
         'date': ['2022-04-16', '2022-04-17', '2022-04-18', '2022-04-19', '2022-05-13', '2022-05-14', '2022-05-15', '2022-05-16'],
         'downtime_minutes':[1200, 240, 1440, 600, 360, 1440, 1440, 180]}

I have converted this into a pandas dataframe:

df = pd.DataFrame(test_data)

print(df)

    equipment_barcode   date          downtime_minutes
0   0001                2022-04-16    1200
1   0001                2022-04-17    240
2   0001                2022-04-18    1440
3   0001                2022-04-19    600
4   0002                2022-05-13    360
5   0002                2022-05-14    1440
6   0002                2022-05-15    1440
7   0002                2022-05-16    180

There are 1,440 minutes in one day. This information is required to calculate the NUMBER of faults. A fault may occur over consecutive number of days and thus will still be counted as ONE fault only.

The task is to find the number of faults for each equipment_barcode. For example equipment 0001 had 2 faults [one fault on 16th of April lasting for 1200 minutes and one fault on 17th-19th of April lasting for 2280 minutes(240 1440 600)].

For added clarity; equipment 0002 had 1 fault [fault on 13th-16th of May lasting for 3420 minutes (360 1440 1440 180)].

In the end I would like a resulting dataframe that shows something like this:

    equipment_barcode   fault_number    total_downtime_minutes
0   0001                1               1200
1   0001                2               2280
2   0002                1               3420

CodePudding user response：

Can you try the following:

test_data = {
    'equipment_barcode': ['0001', '0001', '0001', '0001', '0002', '0002', '0002', '0002'],
    'date': ['2022-04-15', '2022-04-17', '2022-04-18', '2022-04-19', '2022-05-13', '2022-05-14', '2022-05-15', '2022-05-16'],
    'downtime_minutes':[1200, 240, 1440, 600, 360, 1440, 1440, 180]
}
df = pd.DataFrame(test_data)
df['date'] = pd.to_datetime(df['date'])
dt = df['date']
day = pd.Timedelta('1d')
breaks = dt.diff() != day
groups = breaks.cumsum()
df['groups'] = groups
dft = df.groupby(by=['equipment_barcode', 'groups'])['downtime_minutes'].sum().reset_index()
dft = dft.drop('groups', axis=1)
dft['fault_number'] = dft.groupby(by='equipment_barcode').cumcount()   1
print(dft)

Output:

  equipment_barcode  downtime_minutes  fault_number
0              0001              1200             1
1              0001              2280             2
2              0002              3420             1

CodePudding user response：

try:

data = {'equipment_barcode': ['0001', '0001', '0001', '0001',
                              '0002', '0002', '0002', '0002'], 
        'date': ['2022-04-15', '2022-04-17', '2022-04-18', '2022-04-19',
                 '2022-05-13', '2022-05-14', '2022-05-15', '2022-05-16'],
        'downtime_minutes':[1200, 240, 1440, 600, 360, 1440, 1440, 180]
       }

df = pd.DataFrame(data)
df
'''
  equipment_barcode        date  downtime_minutes
0              0001  2022-04-15              1200
1              0001  2022-04-17               240
2              0001  2022-04-18              1440
3              0001  2022-04-19               600
4              0002  2022-05-13               360
5              0002  2022-05-14              1440
6              0002  2022-05-15              1440
7              0002  2022-05-16               180
'''

df.astype({'date': 'datetime64[ns]'}).dtypes
'''
equipment_barcode            object
date                 datetime64[ns]
downtime_minutes              int64
dtype: object
'''

then:

def func(grp: pd.DataFrame):
    return (
        grp.assign(fault_number=lambda x: x.date.diff() != pd.Timedelta('1d'))
        .assign(fault_number=lambda x: x.fault_number.cumsum())
        .groupby('fault_number')
        .agg(downtime_minutes=('downtime_minutes', sum))
    )

# code
(
    df.astype({'date': 'datetime64[ns]'})
    .groupby(['equipment_barcode'])
    .apply(func)
    .reset_index()
)
'''
  equipment_barcode  fault_number  downtime_minutes
0              0001             1              1200
1              0001             2              2280
2              0002             1              3420
'''