Calculate Time Difference based on Conditionals-CodePudding

I have a dataframe that looks something like this (actual dataframe is millions of rows):

ID	Category	Site	Task Completed	Access Completed
1	A	X	1/2/22 12:00:00AM	1/1/22 12:00:00 AM
1	A	Y	1/3/22 12:00:00AM	1/2/22 12:00:00 AM
1	A	X	1/1/22 1:00:00AM	1/1/22 12:00:00 AM
1	B	X	1/1/22 1:00:00AM	1/1/22 12:00:00 AM
2	A	X	1/2/22 12:00:00AM	1/1/22 12:00:00 AM
2	A	X	1/1/22 1:00:00AM	1/1/22 12:00:00 AM

A few things to point out about this dataframe. First, the 'access completed' date is constant for a ID/category/site combination. So, for ID 1, Cat A, Site X, the access completed will always be 1/1/22 12:00:00 AM no matter how many 'task completed' there are for that combo. Second, there can be multiple task completed dates for a ID/Category/Site combination.

What I want to find is the time difference (in hours) between Access Completed and first Task Completed for every ID/Category/Site combination within the dataset. The intended result would look something like this:

ID	Category	Site	Time Difference
1	A	X	1
1	A	Y	24
1	B	X	1
2	A	X	1

Thanks so much for your help.

CodePudding user response：

Try as follows:

First, use pd.to_datetime to turn date/time strings into datetime.
Sort the df on column "Task Completed", apply df.groupby and get .first.
Now, subtract "Access Completed" from "Task Completed" using Series.sub, and use .dt.total_seconds on the Timedelta values to get all seconds. Since we want hours, we apply .div(3600).
Lastly, reset the index and use df.rename to assign a meaningful column name.

df[['Task Completed','Access Completed']] = \
    df[['Task Completed','Access Completed']].apply(lambda x: pd.to_datetime(x))

res = df.sort_values('Task Completed').groupby(['ID','Category','Site']).first()
res = res['Task Completed'].sub(res['Access Completed'])\
    .dt.total_seconds().div(3600).reset_index(drop=False).rename(
        columns={0:'Time Difference'})

print(res)

   ID Category Site  Time Difference
0   1        A    X              1.0
1   1        A    Y             24.0
2   1        B    X              1.0
3   2        A    X              1.0

CodePudding user response：

df['Task Completed'] = pd.to_datetime(df['Task Completed'])
df['Access Completed'] = pd.to_datetime(df['Access Completed'])
df


df['time difference']=df.groupby(['ID','Category','Site'])['Task Completed'].transform('min').sub(df['Access Completed'])
df['diff (hrs)']=pd.to_timedelta(df['time difference']).dt.total_seconds()/3600

df['diff (hrs)']=pd.to_timedelta(df['time difference']).dt.total_seconds()/3600
df[['ID', 'Category', 'Site',       'time difference', 'diff (hrs)']]

added difference both in days and in hours

    ID  Category    Site    time difference     diff (hrs)
0    1     A         X       0 days 01:00:00    1.0
1    1     A         Y       1 days 00:00:00    24.0
2    1     A         X       0 days 01:00:00    1.0
3    1     B         X       0 days 01:00:00    1.0
4    2     A         X       0 days 01:00:00    1.0
5    2     A         X       0 days 01:00:00    1.0