I am trying to come up with a dynamic way to check for the existence of a string and report back a few different results: gone_client, boomerang, new_client.
If I groupby address_id and my_date, and the pattern is Verizon, Verizon, Comcast, Comcast, the client left Verizon and went to another company.
If the client went from Verizon to Comcast and then back to Verizon, this is a boomerang.
If the client is new to Verizon last month, this is a new client, but if the client was with Verizon at some point in the past and new last month, this is a boomerang.
import pandas as pd import numpy as np
# data stored in dictionary details = { 'address_id': [111,111,111,111,111,111,222,222,222,222,222,222,333,333,333,333,333,333,444,444,444,444,444,444,555,555,555,555,555,555,777,777,777], 'my_company':['Comcast','Verizon','Other','Other','Comcast','Comcast','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Verizon','Verizon','Verizon','Verizon','Verizon','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Other','Verizon','Comcast','Comcast'], 'my_date':['2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28'] } df = pd.DataFrame(details) df
I can't get the logic quite right, but I think it's something along these lines...but something seems to be off.
def f(ser):
if 'Verizon' not in ser.unique():
return False
if 'Verizon' in ser.unique():
return True
return False
df['gone_client'] = df.groupby('address_id')['my_company'].transform(f)
df.head()
Maybe something like this...
# substring to be searched
sub ='Verizon'
# creating and passing series to new column
df["Indexes"]= df["my_company"].str.find(sub)
df
I'm not sure how to handle the dates though, and only by checking the dates can you know whether a client is gone, boomeranging, new, or maybe some weird patterns which doesn't fit into a bucket, like 'Other'. Expected result would be this (last scenario is kind of a boomerang that left yet again). I wouldn't expect many scenarios like this, but there certainly will be some very unique cases, so 'Other' is fine for these records.
address_id my_company my_date status
0 111 Comcast 1/24/2022 Left
1 111 Verizon 2/21/2022 Left
2 111 Other 3/28/2022 Left
3 111 Other 4/25/2022 Left
4 111 Comcast 5/23/2022 Left
5 111 Comcast 6/27/2022 Left
6 222 Spectrum 1/24/2022 Never Client
7 222 Spectrum 2/21/2022 Never Client
8 222 Spectrum 3/28/2022 Never Client
9 222 Spectrum 4/25/2022 Never Client
10 222 Spectrum 5/23/2022 Never Client
11 222 Spectrum 6/27/2022 Never Client
12 333 Verizon 1/24/2022 Never Left
13 333 Verizon 2/21/2022 Never Left
14 333 Verizon 3/28/2022 Never Left
15 333 Verizon 4/25/2022 Never Left
16 333 Verizon 5/23/2022 Never Left
17 333 Verizon 6/27/2022 Never Left
18 444 Spectrum 1/24/2022 Left
19 444 Spectrum 2/21/2022 Left
20 444 Spectrum 3/28/2022 Left
21 444 Spectrum 4/25/2022 Left
22 444 Verizon 5/23/2022 Left
23 444 Spectrum 6/27/2022 Left
24 555 Verizon 1/24/2022 Boomerang
25 555 Spectrum 2/21/2022 Boomerang
26 555 Spectrum 3/28/2022 Boomerang
27 555 Spectrum 4/25/2022 Boomerang
28 555 Verizon 5/23/2022 Boomerang
29 555 Verizon 6/27/2022 Boomerang
30 777 Comcast 1/24/2022 New Client
31 777 Comcast 2/21/2022 New Client
32 777 Verizon 1/24/2022 New Client
33 888 Verizon 2/21/2022 Other
34 888 Comcast 3/28/2022 Other
35 888 Verizon 4/25/2022 Other
36 888 Comcast 5/23/2022 Other
37 888 Comcast 6/27/2022 Other
CodePudding user response:
Here's a solution with some verbose logic that you can play around with. It doesn't sound like you're quite certain of your final logic but hopefully this gives you enough to play around with.
This solution groups the dataframe based on the address_id
. Then, for each individual group, we can examine which my_company
feature. We can use this, along with a timedelta
to have some logic around whether the address is with Verizon, not with Verizon, never with Verizon, is currently with Verizon and came back in the last 30 days, etc.
This answer is not sponsored by Verizon. Other cellular providers exist.
import pandas as pd
import datetime
# data stored in dictionary
details = {
'address_id': [111,111,111,111,111,111,222,222,222,222,222,222,333,333,333,333,333,333,444,444,444,444,444,444,555,555,555,555,555,555,777,777,777],
'my_company':['Comcast','Verizon','Other','Other','Comcast','Comcast','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Verizon','Verizon','Verizon','Verizon','Verizon','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Other','Verizon','Comcast','Comcast'],
'my_date':['2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28']
}
df = pd.DataFrame(details)
df['my_date'] = pd.to_datetime(df['my_date'])
address_groups = df.groupby(['address_id'])
frame_list = []
current_date = datetime.datetime.now()
for group, frame in address_groups:
# Create a list and set of each company used by a given address-id:
company_list = frame['my_company'].values.tolist()
company_set = set(company_list)
# Exclusively Verizon
if ('Verizon' in company_set) and (len(company_set) == 1):
frame['status'] = 'Verizon Diehard'
# Never Verizon
if ('Verizon' not in company_set):
frame['status'] = 'Verizon Never'
# Verizon at some point but not currently
if ('Verizon' in company_set) and (company_list[-1] != 'Verizon'):
v_frame = frame[frame['my_company'] == 'Verizon']
last_verizon_date = v_frame['my_date'].iloc[-1]
last_verizon_date = datetime.datetime.strptime(last_verizon_date, '%Y-%m-%d')
if (current_date - last_verizon_date) < pd.Timedelta("30 days"):
frame['status'] = 'Not curretly Verizon, but was in last 30 days'
else:
frame['status'] = 'Not curretly Verizon, but was so more than 30 days ago'
# Verizon currently but was a boomerang
if (company_list[-1] == 'Verizon') and (len(company_set) >= 2):
non_v_frame = frame[frame['my_company'] != 'Verizon']
last_non_v_date = non_v_frame['my_date'].iloc[-1]
last_non_v_date = datetime.datetime.strptime(last_non_v_date, '%Y-%m-%d')
if (current_date - last_non_v_date) < pd.Timedelta("30 days"):
frame['status'] = 'Boomerang back to Verizon in last 30 days'
else:
frame['status'] = 'Boomerang back more than 30 days ago'
frame_list.append(frame)
final_df = pd.concat(frame_list)