I have a dataframe that looks something like this (actual dataframe is millions of rows):
ID | Category | Site | Task Completed | Access Completed |
---|---|---|---|---|
1 | A | X | 1/2/22 12:00:00AM | 1/1/22 12:00:00 AM |
1 | A | Y | 1/3/22 12:00:00AM | 1/2/22 12:00:00 AM |
1 | A | X | 1/1/22 1:00:00AM | 1/1/22 12:00:00 AM |
1 | B | X | 1/1/22 1:00:00AM | 1/1/22 12:00:00 AM |
2 | A | X | 1/2/22 12:00:00AM | 1/1/22 12:00:00 AM |
2 | A | X | 1/1/22 1:00:00AM | 1/1/22 12:00:00 AM |
A few things to point out about this dataframe. First, the 'access completed' date is constant for a ID/category/site combination. So, for ID 1, Cat A, Site X, the access completed will always be 1/1/22 12:00:00 AM no matter how many 'task completed' there are for that combo. Second, there can be multiple task completed dates for a ID/Category/Site combination.
What I want to find is the time difference (in hours) between Access Completed and first Task Completed for every ID/Category/Site combination within the dataset. The intended result would look something like this:
ID | Category | Site | Time Difference |
---|---|---|---|
1 | A | X | 1 |
1 | A | Y | 24 |
1 | B | X | 1 |
2 | A | X | 1 |
Thanks so much for your help.
CodePudding user response:
Try as follows:
- First, use
pd.to_datetime
to turn date/time strings into datetime. - Sort the
df
on column "Task Completed", applydf.groupby
and get.first
. - Now, subtract "Access Completed" from "Task Completed" using
Series.sub
, and use.dt.total_seconds
on the Timedelta values to get all seconds. Since we want hours, we apply.div(3600)
. - Lastly, reset the index and use
df.rename
to assign a meaningful column name.
df[['Task Completed','Access Completed']] = \
df[['Task Completed','Access Completed']].apply(lambda x: pd.to_datetime(x))
res = df.sort_values('Task Completed').groupby(['ID','Category','Site']).first()
res = res['Task Completed'].sub(res['Access Completed'])\
.dt.total_seconds().div(3600).reset_index(drop=False).rename(
columns={0:'Time Difference'})
print(res)
ID Category Site Time Difference
0 1 A X 1.0
1 1 A Y 24.0
2 1 B X 1.0
3 2 A X 1.0
CodePudding user response:
df['Task Completed'] = pd.to_datetime(df['Task Completed'])
df['Access Completed'] = pd.to_datetime(df['Access Completed'])
df
df['time difference']=df.groupby(['ID','Category','Site'])['Task Completed'].transform('min').sub(df['Access Completed'])
df['diff (hrs)']=pd.to_timedelta(df['time difference']).dt.total_seconds()/3600
df['diff (hrs)']=pd.to_timedelta(df['time difference']).dt.total_seconds()/3600
df[['ID', 'Category', 'Site', 'time difference', 'diff (hrs)']]
added difference both in days and in hours
ID Category Site time difference diff (hrs)
0 1 A X 0 days 01:00:00 1.0
1 1 A Y 1 days 00:00:00 24.0
2 1 A X 0 days 01:00:00 1.0
3 1 B X 0 days 01:00:00 1.0
4 2 A X 0 days 01:00:00 1.0
5 2 A X 0 days 01:00:00 1.0