Loss report, check if user is on the following step and has same CampaignSource, if so display it-CodePudding

Imagine the current dataset

Steps	CampaignSource	set_UserId
One	Pot	a,b,c,d,e
Two	Pot	a,b,c,d
Three	Pot	a,b,c
One	Daddy	e,f,g,h
Two	Daddy	e,f,g
Three	Daddy	e,f

From this dataset you need to build a df, that display the loss users. The ones who are in the first step but not in the second, and so on

Desired result,

Steps	CampaignSource	set_UserId
One	Pot	e
Two	Pot	d
One	Daddy	h
Two	Daddy	g

Things worth noting, when the set of UserIds, differ in CampaignSource. You need to stop so you avoid comparison beetween different CampaignSources.

My attempt at this issue was as follow:

loss_dict = {
    'Steps':[],
    'UserId':[],
    'CampaignSource':[]
}

for a in camp_df['CampaignSource'].unique():
    temp_df = camp_df.loc[camp_df['CampaignSource'] == a].reset_index(drop=True)
    # The dataset is ordered, i just could make it work in ordered manner
    for b,c in enumerate(zip(temp_df['UserId'],temp_df['Steps'],temp_df[['CampaignSource']])):
        try:    
            loss_dict['Steps'].append(c[1])
            loss_dict['UserId'].append(temp_df['UserId'][b]-temp_df['UserId'][b 1])
            loss_dict['CampaignSource'].append(c[2])
        except KeyError:
            continue

Is worth noting i didnt manage to eliminate the last steps, also my approach is a little too convulated, and i am looking for alternatives. Would appreciate some help!

CodePudding user response：

Check if the following could be applied to you problem. First we map the categorical column "Steps" to int. Then we group and shift each of the groups in order to get the desired result:

import pandas as pd

map_order = {"One": 1, "Two": 2, "Three": 3}

df = pd.DataFrame([['One','Pot',set(['a','b','c','d','e'])],
                   ['Two','Pot',set(['a','b','c','d'])],
                   ['Three','Pot',set(['a','b','c'])],
                   ['One','Daddy',set(['e','f','g','h'])],
                   ['Two','Daddy',set(['e','f','g'])],
                   ['Three','Daddy',set(['e','f'])]],
                  columns=['Steps','CampaignSource','set_UserId'])

df["Steps"] = df["Steps"].map(map_order)
df = df.sort_values(['Steps','CampaignSource'])

final_df = []
for _, df_g in df.groupby(["CampaignSource"]):
    df_g["set_UserId_prev"] = df_g.shift(-1)["set_UserId"]
    df_g["diff"] = df_g["set_UserId"] - df_g["set_UserId_prev"]
    final_df.append(df_g)
    
out_df = pd.concat(final_df).dropna()
out_df

Steps   CampaignSource  set_UserId      set_UserId_prev diff
1       Daddy           {g, e, h, f}    {g, e, f}       {h}
2       Daddy           {g, e, f}       {e, f}          {g}
1       Pot             {e, d, a, b, c} {b, d, c, a}    {e}
2       Pot             {b, d, c, a}    {b, c, a}       {d}

CodePudding user response：

Here is one approach to solve it, without the use of FOR loop and taking your dataframe as is, and a very concise (non verbose) solution

df['diff']= df.assign(prev_id  = df['set_UserId'].shift(1) ,
                      prev_src = df['CampaignSource'] == df['CampaignSource'].shift(1)
                     ).apply(lambda row: set(row['prev_id']).difference(set(row['set_UserId']))
                             if row['prev_src'] 
                             else "", axis=1).astype(str).replace(r'{|}|\'','',regex=True)

    Steps   CampaignSource  set_UserId  diff
0   One     Pot             a,b,c,d,e   
1   Two     Pot             a,b,c,d       e
2   Three   Pot             a,b,c         d
3   One     Daddy           e,f,g,h     
4   Two     Daddy           e,f,g         h
5   Three   Daddy           e,f           g

here is how it works, two temporary columns are introduced which essentially are the previous rows values, one is for set_userid and other is to check if CampaignSource has changed. then a set notation is used to find the difference b/w the previous and next value. Since you need the values, and returned result from lambda is a set, parenthesis are removed.