I'm interested in getting to a working prototype on this dataset.
The general problem is to compute the time it takes between when a request is created by the PORTAL to when someone else other than PORTAL touches it. Sample dataset is below:-
import pandas as pd
data = {'Code': ['001', '001', '001', '001','001','002','002','003'],
'ENTRYDATE':['2/18/2022 20:21','2/18/2022 20:32','2/27/2022 14:27','3/1/2022 15:10','2/18/2022 20:21','2/18/2022 20:32','2/27/2022 14:27','3/1/2022 15:10'],
'CREATEDBY': ['PORTAL', 'PORTAL', 'PORTAL', 'TINA','TINA','PORTAL','PORTAL','JAMES'],
'STATUS': ['STOPPED', 'STARTED', 'INITIAL', 'STARTED','INITIAL','INITIAL','INITIAL','INITIAL'],
'UPDATEDBY':['PORTAL', 'PORTAL', 'PORTAL', 'JAMES','JOE','PORTAL','RITA','URL'],
'UPDATEDON':['2/18/2022 20:21','2/18/2022 20:32','2/27/2022 14:27','3/1/2022 15:10','2/18/2022 20:21','2/18/2022 20:32','2/27/2022 14:27','3/1/2022 15:10']}
df_test= pd.DataFrame(data)
The rules in getting the resultant dataframe are as follows:-
For each Code:
If CREATEDBY has PORTAL in it, check whether STATUS called INITIAL and UPDATEDBY called PORTAL is there. If that condition is fulfilled, pick the earliest UPDATEDON that matches. This is because a Code can have STATUS with more than one INITIAL. The resultant UPDATEDON is stored for further computation in Step 3.
For the same Code, look for the next UPDATEDBY that is NOT PORTAL by UPDATEDON.
Get the difference in days/hours and seconds between the date in Step2 and Step one as a new column (DIFFERENCE) item.
For the Code with just one STATUS -- INITIAL/ any other and UPDATEDBY has just PORTAL OR any other status can just be left out. The key is that first CREATEDBY has to be PORTAL AND STATUS --INITIAL and UPDATEDBY --PORTAL has to exist for the computation to be made.
The resultant dataframe should be as below:-
data_result = {'Code': ['001', '002'],
'CREATEDBY': ['PORTAL', 'PORTAL'],
'ENTRYDATE':['2/27/2022 14:27','2/18/2022 20:32'],
'UPDATEDBY':['JAMES', 'RITA'],
'UPDATEDON':['3/1/2022 15:10','2/27/2022 14:27'],
'DIFFERENCE':['4 days','9 days']}
df_result = pd.DataFrame(data_result)
CodePudding user response:
I'm not sure to fully understand the rules, but you could try this:
# rule 1
mask = df_test.loc[
(df_test["CREATEDBY"] == "PORTAL")
& (df_test["STATUS"] == "INITIAL")
& (df_test["UPDATEDBY"] == "PORTAL"),
:,
].index
# rule 2
match = []
for i in mask:
for j, value in enumerate(df_test["UPDATEDBY"].to_list()[i:]):
if value != "PORTAL":
match.append(i j)
break
# rule 3
df_test["UPDATEDON"] = df_test["UPDATEDON"].apply(pd.to_datetime)
for i in match:
df_test.loc[i, "DIFFERENCE"] = (
df_test.loc[i, "UPDATEDON"] - df_test.loc[i - 1, "UPDATEDON"]
)
df_test[["UPDATEDBY", "UPDATEDON", "DIFFERENCE"]] = df_test[
["UPDATEDBY", "UPDATEDON", "DIFFERENCE"]
].shift(-1)
# cleanup
df_test = df_test.dropna().reset_index(drop=True).drop(columns="STATUS")
So that:
print(df_test)
# Output
Code ENTRYDATE CREATEDBY UPDATEDBY UPDATEDON DIFFERENCE
0 001 2/27/2022 14:27 PORTAL JAMES 2022-03-01 15:10:00 2 days 00:43:00
1 002 2/18/2022 20:32 PORTAL RITA 2022-02-27 14:27:00 8 days 17:55:00
CodePudding user response:
Here's another approach:
def prep(df):
mask = (
df[['CREATEDBY', 'STATUS', 'UPDATEDBY']]
.eq(['PORTAL', 'INITIAL', 'PORTAL'])
.all(axis=1)
)
if not mask.any():
return None
i = df.loc[mask].UPDATEDON.idxmin(axis=0)
date_min = df.at[i, 'UPDATEDON']
mask = df.UPDATEDBY.ne('PORTAL') & df.UPDATEDON.gt(date_min)
j = df[mask].UPDATEDON.idxmin(axis=0)
result = df.loc[[i], :]
for col in ['UPDATEDBY', 'UPDATEDON']:
result.at[i, col] = df.at[j, col]
return result.assign(DIFFERENCE=df.at[j, 'UPDATEDON'] - date_min)
df_test.UPDATEDON = pd.to_datetime(df_test.UPDATEDON)
df_test.ENTRYDATE = pd.to_datetime(df_test.ENTRYDATE)
result = (
df_test.groupby('Code', as_index=False).apply(prep)
.drop(columns='STATUS').droplevel(1, axis=0)
)
Result:
Code ENTRYDATE CREATEDBY UPDATEDBY UPDATEDON \
0 001 2022-02-27 14:27:00 PORTAL JAMES 2022-03-01 15:10:00
1 002 2022-02-18 20:32:00 PORTAL RITA 2022-02-27 14:27:00
DIFFERENCE
0 2 days 00:43:00
1 8 days 17:55:00