I have a dataframe that looks something like this:
df = pd.DataFrame(
[[1,'A','X','1/2/22 12:00:00AM'],
[1,'A','X','1/3/22 12:00:00AM'],
[1,'A','X','1/1/22 12:00:00AM'],
[1,'A','X','1/2/22 1:00:00AM'],
[1,'B','Y','1/1/22 1:00:00AM'],
[2,'A','Z','1/2/22 12:00:00AM'],
[2,'A','Z','1/1/22 12:00:00AM'],
columns=['ID', 'Category', 'Site', 'Task Completed'])
ID | Category | Site | Task Completed |
---|---|---|---|
1 | A | X | 1/2/22 12:00:00AM |
1 | A | X | 1/3/22 12:00:00AM |
1 | A | X | 1/1/22 12:00:00AM |
1 | A | X | 1/2/22 1:00:00AM |
1 | B | Y | 1/1/22 1:00:00AM |
2 | A | Z | 1/2/22 12:00:00AM |
2 | A | Z | 1/1/22 12:00:00AM |
As you can see, there can be multiple task completed dates for a ID/Category/Site combo.
What I want to find is the time difference (in days) between the first (min) Task Completed date and the last (max) task completed date for every ID/Category/Site combination within the dataset. I also want to find the number of instances for each ID/Category/Site combo. The intended result would look something like this:
ID | Category | Site | Time Difference | # of instances |
---|---|---|---|---|
1 | A | X | 2 | 4 |
1 | B | Y | 0 | 1 |
2 | A | Z | 1 | 2 |
So far, I know how to get the time difference and the value counts separately:
df['task_completed'] = pd.to_datetime(df['task_completed'], utc=True).apply(lambda x: x.date())
result = df.groupby(['id', 'category', 'site'])['task_completed'].agg(['max','min'])
result['diff'] = result['max']-result['min']
values = df.groupby(['id', 'category', 'site'])['task_completed'].value_counts()
But I'm not sure how to get the value counts and time differences together.
Thanks so much for your help.
CodePudding user response:
Try:
# convert the "Task Completed" column to datetime:
df["Task Completed"] = pd.to_datetime(df["Task Completed"], dayfirst=False)
x = df.groupby(["ID", "Category", "Site"], as_index=False).agg(
**{
"Time Difference": (
"Task Completed",
lambda x: (x.max() - x.min()).days,
),
"# of instances": ("Task Completed", "count"),
}
)
print(x)
Prints:
ID Category Site Time Difference # of instances
0 1 A X 2 4
1 1 B Y 0 1
2 2 A Z 1 2
CodePudding user response:
df = pd.DataFrame(
[[1,'A','X','1/2/22 12:00:00AM'],
[1,'A','X','1/3/22 12:00:00AM'],
[1,'A','X','1/1/22 12:00:00AM'],
[1,'A','X','1/2/22 1:00:00AM'],
[1,'B','Y','1/1/22 1:00:00AM'],
[2,'A','Z','1/2/22 12:00:00AM'],
[2,'A','Z','1/1/22 12:00:00AM']],
columns=['ID', 'Category', 'Site', 'Task Completed'])
df['Task Completed'] = pd.to_datetime(df['Task Completed'])
df['Task Completed'] = df['Task Completed'].apply(lambda x: x.day_of_year)
df_res = df.groupby(['ID','Category','Site'], as_index=False)['Task Completed'].apply(lambda x : x.max()-x.min())
df_res.rename(columns = {'Task Completed':'Day Diff'}, inplace = True)
df_res
df_res
returns
ID Category Site Day Diff
0 1 A X 2
1 1 B Y 0
2 2 A Z 1
CodePudding user response:
pandas' groupby is lazy; this means you can reuse it multiple times after creating it:
df["Task Completed"] = pd.to_datetime(df["Task Completed"], dayfirst=False)
out = df.groupby(['ID', 'Category', 'Site'])['task_completed']
(out
.agg(['size']) # use a list so that a DataFrame is returned
.assign(time_difference = out.max().sub(out.min()).dt.days)
.rename(columns={'size':'# of instances'})
)
# of instances time_difference
ID Category Site
1 A X 4 2
B Y 1 0
2 A Z 2 1