Home > other >  Using later of two dates and then subtracting another date within groupby
Using later of two dates and then subtracting another date within groupby

Time:10-26

I have a dataframe that looks something like this:

df = pd.DataFrame([1,'A','X','1/3/22 12:00:00AM','1/1/22 12:00:00 AM','1/2/22 12:00:00 AM'],
[1,'A','X','1/4/22 12:00:00AM','1/3/22 12:00:00 AM','1/3/22 12:00:00 AM'],
[1,'A','Y','1/3/22 12:00:00AM','1/2/22 12:00:00 AM','1/1/22 12:00:00 AM'],
[1,'B','X','1/3/22 12:00:00AM','1/2/22 12:00:00 AM','1/3/22 12:00:00 AM'],
[2,'A','X','1/5/22 12:00:00AM','1/3/22 12:00:00 AM','1/4/22 12:00:00 AM'],
[2,'A','X','1/6/22 12:00:00AM','1/4/22 12:00:00 AM','1/5/22 12:00:00 AM']],
columns = ['ID','Category','Site','Task Completed','Access Completed', 'Upload Completed'])
ID Category Site Task Completed Access Completed Upload Completed
1 A X 1/3/22 12:00:00AM 1/1/22 12:00:00 AM 1/2/22 12:00:00 AM
1 A X 1/4/22 12:00:00AM 1/3/22 12:00:00 AM 1/3/22 12:00:00 AM
1 A Y 1/3/22 12:00:00AM 1/2/22 12:00:00 AM 1/1/22 12:00:00 AM
1 B X 1/3/22 12:00:00AM 1/2/22 12:00:00 AM 1/3/22 12:00:00 AM
2 A X 1/5/22 12:00:00AM 1/3/22 12:00:00 AM 1/4/22 12:00:00 AM
2 A X 1/6/22 12:00:00AM 1/4/22 12:00:00 AM 1/5/22 12:00:00 AM

What I want to find is the time difference (in hours) between the later of max access completed date or min upload completed date and the first Task Completed date for every ID/Category/Site combination within the dataset.

In plain text, the logic is (the later date of (max access completed date vs min upload completed date)) - (min/first task completed date).

Based on this dataset, my intended result would look like:

ID Category Site Time Difference First Task Completed Max Access Date Min Upload Date
1 A X 0 1/3/22 12:00:00AM 1/3/22 12:00:00 AM 1/2/22 12:00:00 AM
1 A Y 24 1/3/22 12:00:00AM 1/2/22 12:00:00 AM 1/1/22 12:00:00 AM
1 B X 0 1/3/22 12:00:00AM 1/2/22 12:00:00 AM 1/3/22 12:00:00 AM
2 A X 24 1/5/22 12:00:00AM 1/4/22 12:00:00 AM 1/4/22 12:00:00 AM

I know how to get the max access date minus the first task completed date:

out = (df
   .groupby(['ID', 'Category', 'Site'], as_index=False)
   .agg({'Task Completed': 'first', 'Access Completed': 'max'})
   .assign(**{'Time Difference': lambda d: d['Task Completed']
              .sub(d['Access Completed'])
              .dt.total_seconds().floordiv(3600)})
)

Which has output:

   ID Category Site      Task Completed Access Completed  Time Difference
0   1        A    X 2022-01-03 00:00:00       2022-01-02             24.0
1   1        A    Y 2022-01-01 01:00:00       2022-01-01              1.0
2   1        B    X 2022-01-01 01:00:00       2022-01-01              1.0
3   2        A    X 2022-01-03 00:00:00       2022-01-02             24.0

But I don't know how to incorporate min upload completed date.

CodePudding user response:

Try:

df["Task Completed"] = pd.to_datetime(
    df["Task Completed"], format="%m/%d/%y %H:%M:%S%p"
)
df["Access Completed"] = pd.to_datetime(
    df["Access Completed"], format="%m/%d/%y %H:%M:%S %p"
)
df["Upload Completed"] = pd.to_datetime(
    df["Upload Completed"], format="%m/%d/%y %H:%M:%S %p"
)

out = df.groupby(["ID", "Category", "Site"], as_index=False).agg(
    {
        "Task Completed": "first",
        "Access Completed": "max",
        "Upload Completed": "min",
    }
)

out["Time Difference"] = np.where(
    (out["Access Completed"] - out["Upload Completed"]) > pd.Timedelta(0),
    (out["Access Completed"] - out["Task Completed"]).abs().dt.total_seconds()
    / 3600,
    (out["Upload Completed"] - out["Task Completed"]).abs().dt.total_seconds()
    / 3600,
)

print(out)

Prints:

ID Category Site Task Completed Access Completed Upload Completed Time Difference
0 1 A X 2022-01-03 12:00:00 2022-01-03 12:00:00 2022-01-02 12:00:00 0
1 1 A Y 2022-01-03 12:00:00 2022-01-02 12:00:00 2022-01-01 12:00:00 24
2 1 B X 2022-01-03 12:00:00 2022-01-02 12:00:00 2022-01-03 12:00:00 0
3 2 A X 2022-01-05 12:00:00 2022-01-04 12:00:00 2022-01-04 12:00:00 24

CodePudding user response:

@Andrej Kesely's answer is already covered most part. Yet, if you would like to do this without numpy, you can modify the numpy operation part as

max_time = out[["Access Completed", "Upload Completed"]].max(axis=1)
out["Time Difference"] = (out["Task Completed"] - max_time).dt.total_seconds() / 3600
  • Related