Home > Software engineering >  Calculate Time Difference based on Conditionals
Calculate Time Difference based on Conditionals

Time:10-14

I have a dataframe that looks something like this (actual dataframe is millions of rows):

ID Category Site Task Completed Access Completed
1 A X 1/2/22 12:00:00AM 1/1/22 12:00:00 AM
1 A Y 1/3/22 12:00:00AM 1/2/22 12:00:00 AM
1 A X 1/1/22 1:00:00AM 1/1/22 12:00:00 AM
1 B X 1/1/22 1:00:00AM 1/1/22 12:00:00 AM
2 A X 1/2/22 12:00:00AM 1/1/22 12:00:00 AM
2 A X 1/1/22 1:00:00AM 1/1/22 12:00:00 AM

A few things to point out about this dataframe. First, the 'access completed' date is constant for a ID/category/site combination. So, for ID 1, Cat A, Site X, the access completed will always be 1/1/22 12:00:00 AM no matter how many 'task completed' there are for that combo. Second, there can be multiple task completed dates for a ID/Category/Site combination.

What I want to find is the time difference (in hours) between Access Completed and first Task Completed for every ID/Category/Site combination within the dataset. The intended result would look something like this:

ID Category Site Time Difference
1 A X 1
1 A Y 24
1 B X 1
2 A X 1

Thanks so much for your help.

CodePudding user response:

Try as follows:

  • First, use pd.to_datetime to turn date/time strings into datetime.
  • Sort the df on column "Task Completed", apply df.groupby and get .first.
  • Now, subtract "Access Completed" from "Task Completed" using Series.sub, and use .dt.total_seconds on the Timedelta values to get all seconds. Since we want hours, we apply .div(3600).
  • Lastly, reset the index and use df.rename to assign a meaningful column name.
df[['Task Completed','Access Completed']] = \
    df[['Task Completed','Access Completed']].apply(lambda x: pd.to_datetime(x))

res = df.sort_values('Task Completed').groupby(['ID','Category','Site']).first()
res = res['Task Completed'].sub(res['Access Completed'])\
    .dt.total_seconds().div(3600).reset_index(drop=False).rename(
        columns={0:'Time Difference'})

print(res)

   ID Category Site  Time Difference
0   1        A    X              1.0
1   1        A    Y             24.0
2   1        B    X              1.0
3   2        A    X              1.0

CodePudding user response:

df['Task Completed'] = pd.to_datetime(df['Task Completed'])
df['Access Completed'] = pd.to_datetime(df['Access Completed'])
df


df['time difference']=df.groupby(['ID','Category','Site'])['Task Completed'].transform('min').sub(df['Access Completed'])
df['diff (hrs)']=pd.to_timedelta(df['time difference']).dt.total_seconds()/3600

df['diff (hrs)']=pd.to_timedelta(df['time difference']).dt.total_seconds()/3600
df[['ID', 'Category', 'Site',       'time difference', 'diff (hrs)']]

added difference both in days and in hours

    ID  Category    Site    time difference     diff (hrs)
0    1     A         X       0 days 01:00:00    1.0
1    1     A         Y       1 days 00:00:00    24.0
2    1     A         X       0 days 01:00:00    1.0
3    1     B         X       0 days 01:00:00    1.0
4    2     A         X       0 days 01:00:00    1.0
5    2     A         X       0 days 01:00:00    1.0
  • Related