I have two dataframes one is failures dataframe and the other is maintenance dataframe.
I want to find the number of days difference between the records of both these dataframes based on some conditions.
Condition is-
- There are multiple machineIDs and component numbers
comp1, comp2, comp3, comp4
. So, the operation needs to be performed on each machineID and component number such that for every machineID and its component number I have the day difference between failure and maintenance date. - First remove duplicate dates from maintenance dataframe.
- Maintenance date should be older than failure date and there can be multiple dates from maintenance dataframe which are older than failures date.
- So, averaging out after finding their difference and appending it in the failures dataframe as a new column.
The failures dataframe is-
datetime machineID failure
0 2021-02-04 1 comp3
1 2021-03-21 1 comp1
2 2021-04-05 1 comp4
3 2021-05-05 1 comp3
4 2021-05-20 1 comp2
5 2021-06-04 1 comp4
6 2021-06-19 1 comp2
7 2021-08-03 1 comp3
8 2021-08-03 1 comp4
9 2021-11-01 1 comp4
And this is the maintenance dataframe
datetime machineID comp
0 2020-07-01 1 comp4
1 2020-09-14 1 comp1
2 2020-09-14 1 comp2
3 2020-11-13 1 comp3
4 2021-01-05 1 comp1
5 2021-01-20 1 comp1
6 2021-02-04 1 comp3
7 2021-02-19 1 comp3
8 2021-03-06 1 comp3
9 2021-03-21 1 comp1
I am taking an example from failures dataframe
for index 1
with machineID 1
, datetime 2021-03-21
and failure of Component 1
. Now, if see for the same machineID and component1 in maintenance dataframe, there are 3 dates which are older than the one in failures dataframe i.e., 2021-03-21
. Now, I want to calculate day difference of 2021-03-21 from all these 3 dates of maintenance dataframe and average it out and append it the same index of failure dataframe.
failure datetime- 2021-03-21
maintenance datetime- 2020-09-14; 2021-01-05; 2021-01-20
Corresponding day difference- 188 days, 75 days; 60 days.
Average of these 3- (188 75 60)/3= 108 days.
This is something I want-
datetime machineID failure day_dif
1 2021-03-21 1 comp1 108 days
Similarly for component 2, 3 and 4
Edit1- As per @oskros answer, I am getting this error-
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/generic.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
1028 result, _ = self.grouper.aggregate(
-> 1029 block.values, how, axis=1, min_count=min_count
1030 )
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/ops.py in aggregate(self, values, how, axis, min_count)
580 return self._cython_operation(
--> 581 "aggregate", values, how, axis, min_count=min_count
582 )
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/ops.py in _cython_operation(self, kind, values, how, axis, min_count, **kwargs)
510
--> 511 func, values = self._get_cython_func_and_vals(kind, how, values, is_numeric)
512
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/ops.py in _get_cython_func_and_vals(self, kind, how, values, is_numeric)
410 try:
--> 411 func = self._get_cython_function(kind, how, values, is_numeric)
412 except NotImplementedError:
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/ops.py in _get_cython_function(self, kind, how, values, is_numeric)
385 raise NotImplementedError(
--> 386 f"function is not implemented for this dtype: "
387 f"[how->{how},dtype->{dtype_str}]"
NotImplementedError: function is not implemented for this dtype: [how->max,dtype->object]
During handling of the above exception, another exception occurred:
AssertionError Traceback (most recent call last)
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/generic.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
1060 # unwrap DataFrame to get array
-> 1061 assert len(result._data.blocks) == 1
1062 result = result._data.blocks[0].values
AssertionError:
During handling of the above exception, another exception occurred:
AssertionError Traceback (most recent call last)
<ipython-input-119-02d194427401> in <module>
1 # 1) Remove duplicates in maintenance dataframe, by getting the latest maintenance date
2 # for each combination of machineID and comp
----> 3 maint_df = maint.groupby(['machineID', 'comp']).max().reset_index()
4
5 # 2) Merge the two dataframes so we have maintenance and failure dates for each
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/groupby.py in f(self, **kwargs)
1376 # try a cython aggregation if we can
1377 try:
-> 1378 return self._cython_agg_general(alias, alt=npfunc, **kwargs)
1379 except DataError:
1380 pass
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/generic.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
1002 ) -> DataFrame:
1003 agg_blocks, agg_items = self._cython_agg_blocks(
-> 1004 how, alt=alt, numeric_only=numeric_only, min_count=min_count
1005 )
1006 return self._wrap_agged_blocks(agg_blocks, items=agg_items)
/anaconda/envs/sklearnEnv/lib/python3.6/site-packages/pandas/core/groupby/generic.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
1065
1066 finally:
-> 1067 assert not isinstance(result, DataFrame)
1068
1069 if result is not no_result:
AssertionError:
The datatypes of maintenance dataframe
is
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1038 entries, 0 to 1046
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Asset 1038 non-null object
1 Type 1038 non-null object
2 datetime 1038 non-null datetime64[ns]
3 machineID 1038 non-null int64
4 comp 1038 non-null object
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 48.7 KB
CodePudding user response:
If we defined your maintenance dataframe as maint_df
and your failure dataframe as fail_df
, the following code should do the trick
# 1) Remove duplicates in maintenance dataframe, by getting the latest maintenance date
# for each combination of machineID and comp
maint_df = maint_df.groupby(['machineID', 'comp']).max().reset_index()
# 2) Merge the two dataframes so we have maintenance and failure dates for each
# combination
fail_df = fail_df.rename(columns={'failure': 'comp'}) # Rename column allowing the merge
merged = maint_df.merge(fail_df,
on=['machineID', 'comp'],
suffixes=('_maintenance', '_failure'))
# 3) Filter out rows where maintenance date isn't older than failure date
merged = merged[merged['datetime_maintenance'] < merged['datetime_failure']]
# 4) Calculate the day difference between the two dates.
# NB: This assumes your date columns are strings - if they are already date objects,
# you can remove the strptime
merged['diff_days'] = merged.apply(
lambda x: (dt.datetime.strptime(x.datetime_failure, '%Y-%m-%d') -
dt.datetime.strptime(x.datetime_maintenance, '%Y-%m-%d')).days,
axis=1)
# 5) Average the day difference across machineID and comp
print(merged.groupby(['machineID', 'comp']).mean())