Computing time difference in date columns ranked by other different columns Pandas-CodePudding

I'm interested in getting to a working prototype on this dataset.

The general problem is to compute the time it takes between when a request is created by the PORTAL to when someone else other than PORTAL touches it. Sample dataset is below:-

import pandas as pd
data = {'Code': ['001', '001', '001', '001','001','002','002','003'],
        'ENTRYDATE':['2/18/2022 20:21','2/18/2022 20:32','2/27/2022 14:27','3/1/2022 15:10','2/18/2022 20:21','2/18/2022 20:32','2/27/2022 14:27','3/1/2022 15:10'],
        'CREATEDBY': ['PORTAL', 'PORTAL', 'PORTAL', 'TINA','TINA','PORTAL','PORTAL','JAMES'],
        'STATUS': ['STOPPED', 'STARTED', 'INITIAL', 'STARTED','INITIAL','INITIAL','INITIAL','INITIAL'],
        'UPDATEDBY':['PORTAL', 'PORTAL', 'PORTAL', 'JAMES','JOE','PORTAL','RITA','URL'],
        'UPDATEDON':['2/18/2022 20:21','2/18/2022 20:32','2/27/2022 14:27','3/1/2022 15:10','2/18/2022 20:21','2/18/2022 20:32','2/27/2022 14:27','3/1/2022 15:10']}
df_test= pd.DataFrame(data)

The rules in getting the resultant dataframe are as follows:-

For each Code:

If CREATEDBY has PORTAL in it, check whether STATUS called INITIAL and UPDATEDBY called PORTAL is there. If that condition is fulfilled, pick the earliest UPDATEDON that matches. This is because a Code can have STATUS with more than one INITIAL. The resultant UPDATEDON is stored for further computation in Step 3.
For the same Code, look for the next UPDATEDBY that is NOT PORTAL by UPDATEDON.
Get the difference in days/hours and seconds between the date in Step2 and Step one as a new column (DIFFERENCE) item.

For the Code with just one STATUS -- INITIAL/ any other and UPDATEDBY has just PORTAL OR any other status can just be left out. The key is that first CREATEDBY has to be PORTAL AND STATUS --INITIAL and UPDATEDBY --PORTAL has to exist for the computation to be made.

The resultant dataframe should be as below:-

data_result = {'Code': ['001', '002'],        
        'CREATEDBY': ['PORTAL', 'PORTAL'],
        'ENTRYDATE':['2/27/2022 14:27','2/18/2022 20:32'],
        'UPDATEDBY':['JAMES', 'RITA'],        
        'UPDATEDON':['3/1/2022 15:10','2/27/2022 14:27'],
        'DIFFERENCE':['4 days','9 days']}
df_result = pd.DataFrame(data_result)

CodePudding user response：

I'm not sure to fully understand the rules, but you could try this:

# rule 1
mask = df_test.loc[
    (df_test["CREATEDBY"] == "PORTAL")
    & (df_test["STATUS"] == "INITIAL")
    & (df_test["UPDATEDBY"] == "PORTAL"),
    :,
].index

# rule 2
match = []
for i in mask:
    for j, value in enumerate(df_test["UPDATEDBY"].to_list()[i:]):
        if value != "PORTAL":
            match.append(i   j)
            break

# rule 3
df_test["UPDATEDON"] = df_test["UPDATEDON"].apply(pd.to_datetime)
for i in match:
    df_test.loc[i, "DIFFERENCE"] = (
        df_test.loc[i, "UPDATEDON"] - df_test.loc[i - 1, "UPDATEDON"]
    )
df_test[["UPDATEDBY", "UPDATEDON", "DIFFERENCE"]] = df_test[
    ["UPDATEDBY", "UPDATEDON", "DIFFERENCE"]
].shift(-1)

# cleanup
df_test = df_test.dropna().reset_index(drop=True).drop(columns="STATUS")

So that:

print(df_test)

# Output
  Code        ENTRYDATE CREATEDBY UPDATEDBY           UPDATEDON      DIFFERENCE
0  001  2/27/2022 14:27    PORTAL     JAMES 2022-03-01 15:10:00 2 days 00:43:00
1  002  2/18/2022 20:32    PORTAL      RITA 2022-02-27 14:27:00 8 days 17:55:00

CodePudding user response：

Here's another approach:

def prep(df):
    mask = (
        df[['CREATEDBY', 'STATUS', 'UPDATEDBY']]
        .eq(['PORTAL', 'INITIAL', 'PORTAL'])
        .all(axis=1)
    )
    if not mask.any():
        return None
    i = df.loc[mask].UPDATEDON.idxmin(axis=0)
    date_min = df.at[i, 'UPDATEDON']

    mask = df.UPDATEDBY.ne('PORTAL') & df.UPDATEDON.gt(date_min)
    j = df[mask].UPDATEDON.idxmin(axis=0)

    result = df.loc[[i], :]
    for col in ['UPDATEDBY', 'UPDATEDON']:
        result.at[i, col] = df.at[j, col]
    return result.assign(DIFFERENCE=df.at[j, 'UPDATEDON'] - date_min)

df_test.UPDATEDON = pd.to_datetime(df_test.UPDATEDON)
df_test.ENTRYDATE = pd.to_datetime(df_test.ENTRYDATE)

result = (
    df_test.groupby('Code', as_index=False).apply(prep)
    .drop(columns='STATUS').droplevel(1, axis=0)
)

Result:

  Code           ENTRYDATE CREATEDBY UPDATEDBY           UPDATEDON  \
0  001 2022-02-27 14:27:00    PORTAL     JAMES 2022-03-01 15:10:00   
1  002 2022-02-18 20:32:00    PORTAL      RITA 2022-02-27 14:27:00   

       DIFFERENCE  
0 2 days 00:43:00  
1 8 days 17:55:00