I have hundreds of project entries that I am trying to write a function to calculate the time spent in a project stage during a given month only.
I have an entry for the project every time there is a project stage change. The clock starts at the beginning of the month or when the StageID is 2. And ends at the end of the month or when the StageID is 3.
How do I calculate the time a project spends in StageID 2 between the specified start_date and end_date using the dataframe below?
start_date = pd.to_datetime('2021-07-01 00:00:00.000')
end_date = pd.to_datetime('2021-07-31 23:59:59.999')
df = pd.DataFrame({'ProjectID': [1,1,2,2],
'StageID' : [2,3,2,3,],
'Date' : ["2021-7-2", "2021-7-28", "2021-7-15", "2021-8-22"]})
I expect the result to look like:
ProjectID DaysPassed
0 1 26
1 2 16
Bonus questions:
df = pd.DataFrame({'ProjectID': [1,1,1,1,2,2],
'StageID' : [2,3,2,3,2,3,],
'Date' : ["2021-7-5", "2021-7-10","2021-7-15","2021-7-20", "2021-7-15", "2021-8-22"]})
expect the result to look like:
ProjectID DaysPassed
0 1 10
1 2 16
CodePudding user response:
IIUC, you want:
output = df[df["StageID"].eq(2)].drop("StageID", axis=1).merge(df[df["StageID"].eq(3)].drop("StageID", axis=1), on="ProjectID")
output["DaysPassed"] = output["Date_y"].clip(upper=pd.Timestamp(2021,7,31)).sub(output["Date_x"])
output = output.drop(["Date_x", "Date_y"],axis=1)
>>> df
ProjectID DaysPassed
0 1 26 days
1 2 16 days
If your projects can have multiple iterations, you can create a new column to identify the project uniquely:
df = pd.DataFrame({'ProjectID': [1,1,1,1,2,2],
'StageID' : [2,3,2,3,2,3,],
'Date' : ["2021-7-5", "2021-7-10","2021-7-15","2021-7-20", "2021-7-15", "2021-8-22"]})
df["Date"] = pd.to_datetime(df["Date"])
df["Iteration"] = df.groupby(["ProjectID", "StageID"])["Date"].cumcount() 1
output = df[df["StageID"].eq(2)].drop("StageID", axis=1).merge(df[df["StageID"].eq(3)].drop("StageID", axis=1), on=["ProjectID", "Iteration"])
output["DaysPassed"] = output["Date_y"].clip(upper=pd.Timestamp(2021,7,31)).sub(output["Date_x"]).dt.output = output.drop(["Date_x", "Date_y"],axis=1)
>>> output
ProjectID Iteration DaysPassed
0 1 1 5
1 1 2 5
2 2 1 16