I have got a large DF that contains some sales opportunities. this opps change stage several times during their lifecycle and we can see what those changes are and when were they made. The possible stages are:
Closed Won
Closed No Deal
Propose
Negotiate
Qualify
Closed Lost
Invalid
Identify
Implemented
Close Lost
Close Won
Close No Deal
I also know the OldValue before the opp was changed. What i need to do is to have a separate DF where i get the median value of how long an opportunity stayed at each stage. I have researched the web and SO and i cannot find a solution that meets my criteria.
Here is some sample data:
reatedDate OpportunityId OldValue NewValue
2020-05-05T12:04:32.000Z 0060N00000TbLneQAF Propose Qualify
2020-07-06T08:44:08.000Z 0060N00000TbLneQAF Qualify Identify
2020-08-05T08:59:45.000Z 0060N00000TbLneQAF Identify Qualify
2020-08-05T12:02:59.000Z 0060N00000TbLneQAF Qualify Propose
2020-09-22T06:47:16.000Z 0060N00000TbLneQAF Propose Qualify
2020-10-08T15:33:29.000Z 0060N00000TbLneQAF Qualify Identify
2020-10-08T15:40:21.000Z 0060N00000TbLneQAF Identify Closed No Deal
2021-07-29T07:57:28.000Z 0060N00000TbLohQAF Identify Closed No Deal
2021-10-17T03:07:24.000Z 0060N00000TbLtwQAF Qualify Closed No Deal
2021-07-27T13:57:34.000Z 0060N00000TbMhkQAF Identify Closed No Deal
2020-04-22T13:35:30.000Z 0060N00000TbMkjQAF Negotiate Closed Lost
2020-09-25T09:37:32.000Z 0060N00000TbN8qQAF Qualify Propose
2020-09-25T09:37:41.000Z 0060N00000TbN8qQAF Propose Negotiate
2021-09-06T14:31:05.000Z 0060N00000TbN8qQAF Negotiate Propose
2021-11-03T11:09:56.000Z 0060N00000TbNF8QAN Identify Qualify
2020-04-29T15:43:58.000Z 0060N00000TbNFSQA3 Identify Invalid
2021-01-07T09:35:56.000Z 0060N00000TbNUDQA3 Identify Closed No Deal
2020-12-03T08:53:12.000Z 0060N00000TbNUSQA3 Qualify Identify
2021-09-08T09:41:54.000Z 0060N00000TbNUSQA3 Identify Closed Lost
2021-04-14T07:31:49.000Z 0060N00000TbNg4QAF Identify Closed No Deal
2020-04-27T12:19:51.000Z 0060N00000TbNwCQAV Qualify Identify
2020-05-04T11:15:00.000Z 0060N00000TbNxPQAV Identify Closed No Deal
2021-05-24T03:13:10.000Z 0060N00000TbNywQAF Qualify Closed No Deal
2021-05-28T14:51:32.000Z 0060N00000TbO3SQAV Identify Invalid
2021-07-27T13:25:50.000Z 0060N00000TbOBlQAN Identify Closed No Deal
2021-07-27T13:25:50.000Z 0060N00000TbOCeQAN Identify Closed No Deal
2021-07-27T13:25:50.000Z 0060N00000TbOELQA3 Identify Closed No Deal
2020-04-28T15:12:53.000Z 0060N00000TbOIrQAN Qualify Negotiate
2020-05-18T14:11:18.000Z 0060N00000TbOIrQAN Negotiate Closed Won
2021-07-27T13:22:09.000Z 0060N00000TbOJzQAN Identify Closed No Deal
2021-07-27T13:22:09.000Z 0060N00000TbOLbQAN Identify Closed No Deal
2020-08-13T05:04:36.000Z 0060N00000TbOOzQAN Propose Identify
2020-09-06T15:36:36.000Z 0060N00000TbOOzQAN Identify Invalid
2021-05-27T14:22:10.000Z 0060N00000TbOWKQA3 Qualify Identify
2021-05-27T14:22:27.000Z 0060N00000TbOWKQA3 Identify Closed Lost
2020-04-27T12:25:52.000Z 0060N00000TbOX3QAN Qualify Identify
2021-01-08T15:27:33.000Z 0060N00000TbOX3QAN Identify Qualify
2020-04-13T10:53:57.000Z 0060N00000TbOY6QAN Qualify Identify
2020-12-03T10:38:35.000Z 0060N00000TbOY6QAN Identify Closed Lost
2020-04-13T10:54:57.000Z 0060N00000TbOYaQAN Qualify Identify
2020-12-01T10:50:41.000Z 0060N00000TbOYaQAN Identify Closed No Deal
2021-07-27T13:57:34.000Z 0060N00000TbOb5QAF Identify Closed No Deal
2021-07-27T13:57:34.000Z 0060N00000TbOeYQAV Identify Closed No Deal
2020-05-29T12:28:44.000Z 0060N00000TbOgyQAF Identify Qualify
2020-12-18T07:34:18.000Z 0060N00000TbOgyQAF Qualify Identify
2020-12-18T07:34:43.000Z 0060N00000TbOgyQAF Identify Invalid
2021-07-27T13:22:09.000Z 0060N00000TbOhSQAV Identify Closed No Deal
2020-04-15T11:30:09.000Z 0060N00000TbOk7QAF Identify Invalid
2020-08-26T03:16:46.000Z 0060N00000TbOnVQAV Qualify Closed No Deal
2020-04-03T13:08:23.000Z 0060N00000TbOy4QAF Identify Closed Lost
what i have done so far is try to use the group_by function and then try to get the median but i get stuck because i need to be looking at the dates of when the opp was changed and compare that to the time it took to get the new value. I would be looking at the below outcome:
Stage Average Time
Closed Won x days
Closed No Deal x days
Propose x days
Negotiate x days
Qualify x days
Closed Lost x days
Invalid x days
Identify x days
Implemented x days
Close Lost x days
Close Won x days
Close No Deal x days
Hope that makes sense. i am a total noob by the way so thank you for helping!
CodePudding user response:
This worked for me, but you forgot to mention wich "OldValue" represents the creation of an OPP.
df['difference'] = df.groupby('OpportunityId').reatedDate.diff()
df['aux'] = df['OldValue'] ' - ' df['NewValue']
df['days_diff'] = df['difference'].dt.days
df.groupby('aux')['days_diff'].mean()
Output:
aux days_difference
Identify - Closed Lost 170.666667
Identify - Closed No Deal 115.500000
Identify - Invalid 12.000000
Identify - Qualify 143.000000
Negotiate - Closed Lost NaN
Negotiate - Closed Won 19.000000
Negotiate - Propose 346.000000
Propose - Identify NaN
Propose - Negotiate 0.000000
Propose - Qualify 47.000000
Qualify - Closed No Deal NaN
Qualify - Identify 93.000000
Qualify - Negotiate NaN
Qualify - Propose 0.000000
CodePudding user response:
I think a possible solution would be to first sort the dataframe by OpportunityId
and reatedDate
:
sorted_df = df.sort_values(by=['OpportunityId','reatedDate'])
Then you can get the row wise difference:
sorted_df['time_diff'] = sorted_df.reatedDate.diff()
Also you need to identify the cases where there is a transition in the ID. As those need to be excluded from the statistics:
sorted_df['id_diff'] = sorted_df.OpportunityId.diff()
filtered_df=sorted_df[sorted_df.id_diff==0]
Then you should be able to do your stats:
filtered_df.groupby(['OldValue'])[['time_diff']].median()
If you need only the days you might add a .dt.days