I have some test data which looks like this:
test_data = {'equipment_barcode': ['0001', '0001', '0001', '0001', '0002', '0002', '0002', '0002'],
'date': ['2022-04-16', '2022-04-17', '2022-04-18', '2022-04-19', '2022-05-13', '2022-05-14', '2022-05-15', '2022-05-16'],
'downtime_minutes':[1200, 240, 1440, 600, 360, 1440, 1440, 180]}
I have converted this into a pandas dataframe:
df = pd.DataFrame(test_data)
print(df)
equipment_barcode date downtime_minutes
0 0001 2022-04-16 1200
1 0001 2022-04-17 240
2 0001 2022-04-18 1440
3 0001 2022-04-19 600
4 0002 2022-05-13 360
5 0002 2022-05-14 1440
6 0002 2022-05-15 1440
7 0002 2022-05-16 180
There are 1,440 minutes in one day. This information is required to calculate the NUMBER of faults. A fault may occur over consecutive number of days and thus will still be counted as ONE fault only.
The task is to find the number of faults for each equipment_barcode. For example equipment 0001 had 2 faults [one fault on 16th of April lasting for 1200 minutes and one fault on 17th-19th of April lasting for 2280 minutes(240 1440 600)].
For added clarity; equipment 0002 had 1 fault [fault on 13th-16th of May lasting for 3420 minutes (360 1440 1440 180)].
In the end I would like a resulting dataframe that shows something like this:
equipment_barcode fault_number total_downtime_minutes
0 0001 1 1200
1 0001 2 2280
2 0002 1 3420
CodePudding user response:
Can you try the following:
test_data = {
'equipment_barcode': ['0001', '0001', '0001', '0001', '0002', '0002', '0002', '0002'],
'date': ['2022-04-15', '2022-04-17', '2022-04-18', '2022-04-19', '2022-05-13', '2022-05-14', '2022-05-15', '2022-05-16'],
'downtime_minutes':[1200, 240, 1440, 600, 360, 1440, 1440, 180]
}
df = pd.DataFrame(test_data)
df['date'] = pd.to_datetime(df['date'])
dt = df['date']
day = pd.Timedelta('1d')
breaks = dt.diff() != day
groups = breaks.cumsum()
df['groups'] = groups
dft = df.groupby(by=['equipment_barcode', 'groups'])['downtime_minutes'].sum().reset_index()
dft = dft.drop('groups', axis=1)
dft['fault_number'] = dft.groupby(by='equipment_barcode').cumcount() 1
print(dft)
Output:
equipment_barcode downtime_minutes fault_number
0 0001 1200 1
1 0001 2280 2
2 0002 3420 1
CodePudding user response:
try:
data = {'equipment_barcode': ['0001', '0001', '0001', '0001',
'0002', '0002', '0002', '0002'],
'date': ['2022-04-15', '2022-04-17', '2022-04-18', '2022-04-19',
'2022-05-13', '2022-05-14', '2022-05-15', '2022-05-16'],
'downtime_minutes':[1200, 240, 1440, 600, 360, 1440, 1440, 180]
}
df = pd.DataFrame(data)
df
'''
equipment_barcode date downtime_minutes
0 0001 2022-04-15 1200
1 0001 2022-04-17 240
2 0001 2022-04-18 1440
3 0001 2022-04-19 600
4 0002 2022-05-13 360
5 0002 2022-05-14 1440
6 0002 2022-05-15 1440
7 0002 2022-05-16 180
'''
df.astype({'date': 'datetime64[ns]'}).dtypes
'''
equipment_barcode object
date datetime64[ns]
downtime_minutes int64
dtype: object
'''
then:
def func(grp: pd.DataFrame):
return (
grp.assign(fault_number=lambda x: x.date.diff() != pd.Timedelta('1d'))
.assign(fault_number=lambda x: x.fault_number.cumsum())
.groupby('fault_number')
.agg(downtime_minutes=('downtime_minutes', sum))
)
# code
(
df.astype({'date': 'datetime64[ns]'})
.groupby(['equipment_barcode'])
.apply(func)
.reset_index()
)
'''
equipment_barcode fault_number downtime_minutes
0 0001 1 1200
1 0001 2 2280
2 0002 1 3420
'''