How to write a function to find clients that are gone, boomeranging, new, etc?-CodePudding

I am trying to come up with a dynamic way to check for the existence of a string and report back a few different results: gone_client, boomerang, new_client.

If I groupby address_id and my_date, and the pattern is Verizon, Verizon, Comcast, Comcast, the client left Verizon and went to another company.
If the client went from Verizon to Comcast and then back to Verizon, this is a boomerang.

If the client is new to Verizon last month, this is a new client, but if the client was with Verizon at some point in the past and new last month, this is a boomerang.

import pandas as pd import numpy as np

# data stored in dictionary
details = {
    'address_id': [111,111,111,111,111,111,222,222,222,222,222,222,333,333,333,333,333,333,444,444,444,444,444,444,555,555,555,555,555,555,777,777,777],
    'my_company':['Comcast','Verizon','Other','Other','Comcast','Comcast','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Verizon','Verizon','Verizon','Verizon','Verizon','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Other','Verizon','Comcast','Comcast'],
    'my_date':['2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28']
}

df = pd.DataFrame(details)
df

I can't get the logic quite right, but I think it's something along these lines...but something seems to be off.

def f(ser):
    if 'Verizon' not in ser.unique():
        return False
    if 'Verizon' in ser.unique():
        return True
    return False

df['gone_client'] = df.groupby('address_id')['my_company'].transform(f)
df.head()

Maybe something like this...

# substring to be searched
sub ='Verizon'
 
# creating and passing series to new column
df["Indexes"]= df["my_company"].str.find(sub)
df

I'm not sure how to handle the dates though, and only by checking the dates can you know whether a client is gone, boomeranging, new, or maybe some weird patterns which doesn't fit into a bucket, like 'Other'. Expected result would be this (last scenario is kind of a boomerang that left yet again). I wouldn't expect many scenarios like this, but there certainly will be some very unique cases, so 'Other' is fine for these records.

    address_id      my_company  my_date     status
0   111             Comcast     1/24/2022   Left
1   111             Verizon     2/21/2022   Left
2   111             Other       3/28/2022   Left
3   111             Other       4/25/2022   Left
4   111             Comcast     5/23/2022   Left
5   111             Comcast     6/27/2022   Left
6   222             Spectrum    1/24/2022   Never Client
7   222             Spectrum    2/21/2022   Never Client
8   222             Spectrum    3/28/2022   Never Client
9   222             Spectrum    4/25/2022   Never Client
10  222             Spectrum    5/23/2022   Never Client
11  222             Spectrum    6/27/2022   Never Client
12  333             Verizon     1/24/2022   Never Left
13  333             Verizon     2/21/2022   Never Left
14  333             Verizon     3/28/2022   Never Left
15  333             Verizon     4/25/2022   Never Left
16  333             Verizon     5/23/2022   Never Left
17  333             Verizon     6/27/2022   Never Left
18  444             Spectrum    1/24/2022   Left
19  444             Spectrum    2/21/2022   Left
20  444             Spectrum    3/28/2022   Left
21  444             Spectrum    4/25/2022   Left
22  444             Verizon     5/23/2022   Left
23  444             Spectrum    6/27/2022   Left
24  555             Verizon     1/24/2022   Boomerang
25  555             Spectrum    2/21/2022   Boomerang
26  555             Spectrum    3/28/2022   Boomerang
27  555             Spectrum    4/25/2022   Boomerang
28  555             Verizon     5/23/2022   Boomerang
29  555             Verizon     6/27/2022   Boomerang
30  777             Comcast     1/24/2022   New Client
31  777             Comcast     2/21/2022   New Client
32  777             Verizon     1/24/2022   New Client
33  888             Verizon     2/21/2022   Other
34  888             Comcast     3/28/2022   Other
35  888             Verizon     4/25/2022   Other
36  888             Comcast     5/23/2022   Other
37  888             Comcast     6/27/2022   Other

CodePudding user response：

Here's a solution with some verbose logic that you can play around with. It doesn't sound like you're quite certain of your final logic but hopefully this gives you enough to play around with.

This solution groups the dataframe based on the address_id. Then, for each individual group, we can examine which my_company feature. We can use this, along with a timedelta to have some logic around whether the address is with Verizon, not with Verizon, never with Verizon, is currently with Verizon and came back in the last 30 days, etc.

This answer is not sponsored by Verizon. Other cellular providers exist.

import pandas as pd
import datetime

# data stored in dictionary
details = {
    'address_id': [111,111,111,111,111,111,222,222,222,222,222,222,333,333,333,333,333,333,444,444,444,444,444,444,555,555,555,555,555,555,777,777,777],
    'my_company':['Comcast','Verizon','Other','Other','Comcast','Comcast','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Verizon','Verizon','Verizon','Verizon','Verizon','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Other','Verizon','Comcast','Comcast'],
    'my_date':['2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28']
}

df = pd.DataFrame(details)

df['my_date'] = pd.to_datetime(df['my_date']) 

address_groups = df.groupby(['address_id'])

frame_list = []

current_date = datetime.datetime.now()

for group, frame in address_groups:

    # Create a list and set of each company used by a given address-id:
    company_list = frame['my_company'].values.tolist()
    company_set = set(company_list)

    # Exclusively Verizon
    if ('Verizon' in company_set) and (len(company_set) == 1):
        frame['status'] = 'Verizon Diehard'

    # Never Verizon
    if ('Verizon' not in company_set):
        frame['status'] = 'Verizon Never'

    # Verizon at some point but not currently
    if ('Verizon' in company_set) and (company_list[-1] != 'Verizon'):

        v_frame = frame[frame['my_company'] == 'Verizon']
        last_verizon_date = v_frame['my_date'].iloc[-1]
        last_verizon_date = datetime.datetime.strptime(last_verizon_date, '%Y-%m-%d')
        if (current_date - last_verizon_date) < pd.Timedelta("30 days"):
            frame['status'] = 'Not curretly Verizon, but was in last 30 days'
        else:
            frame['status'] = 'Not curretly Verizon, but was so more than 30 days ago'
 
    # Verizon currently but was a boomerang
    if (company_list[-1] == 'Verizon') and (len(company_set) >= 2):

        non_v_frame = frame[frame['my_company'] != 'Verizon']
        last_non_v_date = non_v_frame['my_date'].iloc[-1]
        last_non_v_date = datetime.datetime.strptime(last_non_v_date, '%Y-%m-%d')

        if (current_date - last_non_v_date) < pd.Timedelta("30 days"):
            frame['status'] = 'Boomerang back to Verizon in last 30 days'
        else:
            frame['status'] = 'Boomerang back more than 30 days ago'


    frame_list.append(frame)

final_df = pd.concat(frame_list)