I have a dataframe that looks something like this:
df = pd.DataFrame([1,'A','X','1/2/22 12:00:00AM','1/1/22 12:00:00 AM'],
[1,'A','X','1/1/22 1:00:00AM','1/1/22 12:00:00 AM'],
[1,'A','Y','1/3/22 12:00:00AM','1/2/22 12:00:00 AM'],
[1,'B','X','1/1/22 1:00:00AM','1/1/22 12:00:00 AM'],
[2,'A','X','1/2/22 12:00:00AM','1/1/22 12:00:00 AM'],
[2,'A','X','1/1/22 1:00:00AM','1/1/22 12:00:00 AM'],
columns = ['ID','Category','Site','Task Completed','Access Completed'])
ID | Category | Site | Task Completed | Access Completed |
---|---|---|---|---|
1 | A | X | 1/2/22 12:00:00AM | 1/1/22 12:00:00 AM |
1 | A | Y | 1/3/22 12:00:00AM | 1/2/22 12:00:00 AM |
1 | A | X | 1/1/22 1:00:00AM | 1/1/22 12:00:00 AM |
1 | B | X | 1/1/22 1:00:00AM | 1/1/22 12:00:00 AM |
2 | A | X | 1/2/22 12:00:00AM | 1/1/22 12:00:00 AM |
2 | A | X | 1/1/22 1:00:00AM | 1/1/22 12:00:00 AM |
Quick note - the access completed date is the same for every ID/Site/Category pair no matter how many instances there are of them.
What I want to find is the time difference (in hours) between Access Completed and the first Task Completed for every ID/Category/Site combination within the dataset. I also want to include that first task completed date and the Access completed date along side the result.
I am able to get the time difference calculation but I'm not sure how to tie in the first task completed date and the Access completed date for each of the ID/Category/Site combos. Here's what I have so far:
df[['Task Completed','Access Completed']] = \
df[['Task Completed','Access Completed']].apply(lambda x: pd.to_datetime(x))
res = df.sort_values('Task Completed').groupby(['ID','Category','Site']).first()
res = res['Task Completed'].sub(res['Access Completed'])\
.dt.total_seconds().div(3600).reset_index(drop=False).rename(
columns={0:'Time Difference'})
This has an output of:
ID Category Site Time Difference
0 1 A X 1.0
1 1 A Y 24.0
2 1 B X 1.0
3 2 A X 1.0
This is my intended result:
ID | Category | Site | Time Difference | First Task Completed | Access Completed |
---|---|---|---|---|---|
1 | A | X | 1 | 1/1/22 1:00:00AM | 1/1/22 12:00:00 AM |
1 | A | Y | 24 | 1/3/22 12:00:00AM | 1/2/22 12:00:00 AM |
1 | B | X | 1 | 1/1/22 1:00:00AM | 1/1/22 12:00:00 AM |
2 | A | X | 1 | 1/1/22 1:00:00AM | 1/1/22 12:00:00 AM |
Thanks in advance for your help.
CodePudding user response:
Process used:
- First convert the dates into appropriate format to allow subtraction
- Remove the duplicates, and keep only the first date.
- With duplicates removed, now calculate the differences
- Needed to re-order, rename and format the columns to match the expected output
import pandas as pd
cols = ['ID','Category','Site','Task Completed','Access Completed']
df = pd.DataFrame([[1,'A','X','1/2/22 12:00:00AM','1/1/22 12:00:00 AM'],
[1,'A','X','1/1/22 1:00:00AM','1/1/22 12:00:00 AM'],
[1,'A','Y','1/3/22 12:00:00AM','1/2/22 12:00:00 AM'],
[1,'B','X','1/1/22 1:00:00AM','1/1/22 12:00:00 AM'],
[2,'A','X','1/2/22 12:00:00AM','1/1/22 12:00:00 AM'],
[2,'A','X','1/1/22 1:00:00AM','1/1/22 12:00:00 AM']],
columns = cols)
#Convert to datetime
df[['Task Completed','Access Completed']] = df[['Task Completed','Access Completed']].apply(lambda x: pd.to_datetime(x))
# Remove duplicate columns - only keep the first task completed.
res = df.sort_values('Task Completed')\
.drop_duplicates(subset=["ID", "Category", 'Site'], keep='first')\
.sort_index()
# Calculate time difference
res['Time Difference'] = res['Task Completed'].sub(res['Access Completed']).dt.total_seconds().div(3600)
#Re-order and re-name columns
cols.insert(3,'Time Difference')
res = res[cols].rename(columns={"Task Completed": "First Task Completed"})
# Convert the dates back to desired format
res["First Task Completed"] = res["First Task Completed"].dt.strftime('%m/%d/%Y %H:%M:%S %p')
res["Access Completed"] = res["Access Completed"].dt.strftime('%m/%d/%Y %H:%M:%S %p')
print(res)
OUTPUT:
ID Category Site Time Difference First Task Completed Access Completed
1 1 A X 1.0 01/01/2022 01:00:00 AM 01/01/2022 00:00:00 AM
2 1 A Y 24.0 01/03/2022 00:00:00 AM 01/02/2022 00:00:00 AM
3 1 B X 1.0 01/01/2022 01:00:00 AM 01/01/2022 00:00:00 AM
5 2 A X 1.0 01/01/2022 01:00:00 AM 01/01/2022 00:00:00 AM