Imagine the current dataset
Steps | CampaignSource | set_UserId |
---|---|---|
One | Pot | a,b,c,d,e |
Two | Pot | a,b,c,d |
Three | Pot | a,b,c |
One | Daddy | e,f,g,h |
Two | Daddy | e,f,g |
Three | Daddy | e,f |
From this dataset you need to build a df, that display the loss users. The ones who are in the first step but not in the second, and so on
Desired result,
Steps | CampaignSource | set_UserId |
---|---|---|
One | Pot | e |
Two | Pot | d |
One | Daddy | h |
Two | Daddy | g |
Things worth noting, when the set of UserIds, differ in CampaignSource. You need to stop so you avoid comparison beetween different CampaignSources.
My attempt at this issue was as follow:
loss_dict = {
'Steps':[],
'UserId':[],
'CampaignSource':[]
}
for a in camp_df['CampaignSource'].unique():
temp_df = camp_df.loc[camp_df['CampaignSource'] == a].reset_index(drop=True)
# The dataset is ordered, i just could make it work in ordered manner
for b,c in enumerate(zip(temp_df['UserId'],temp_df['Steps'],temp_df[['CampaignSource']])):
try:
loss_dict['Steps'].append(c[1])
loss_dict['UserId'].append(temp_df['UserId'][b]-temp_df['UserId'][b 1])
loss_dict['CampaignSource'].append(c[2])
except KeyError:
continue
Is worth noting i didnt manage to eliminate the last steps, also my approach is a little too convulated, and i am looking for alternatives. Would appreciate some help!
CodePudding user response:
Check if the following could be applied to you problem. First we map the categorical column "Steps" to int. Then we group and shift each of the groups in order to get the desired result:
import pandas as pd
map_order = {"One": 1, "Two": 2, "Three": 3}
df = pd.DataFrame([['One','Pot',set(['a','b','c','d','e'])],
['Two','Pot',set(['a','b','c','d'])],
['Three','Pot',set(['a','b','c'])],
['One','Daddy',set(['e','f','g','h'])],
['Two','Daddy',set(['e','f','g'])],
['Three','Daddy',set(['e','f'])]],
columns=['Steps','CampaignSource','set_UserId'])
df["Steps"] = df["Steps"].map(map_order)
df = df.sort_values(['Steps','CampaignSource'])
final_df = []
for _, df_g in df.groupby(["CampaignSource"]):
df_g["set_UserId_prev"] = df_g.shift(-1)["set_UserId"]
df_g["diff"] = df_g["set_UserId"] - df_g["set_UserId_prev"]
final_df.append(df_g)
out_df = pd.concat(final_df).dropna()
out_df
Steps CampaignSource set_UserId set_UserId_prev diff
1 Daddy {g, e, h, f} {g, e, f} {h}
2 Daddy {g, e, f} {e, f} {g}
1 Pot {e, d, a, b, c} {b, d, c, a} {e}
2 Pot {b, d, c, a} {b, c, a} {d}
CodePudding user response:
Here is one approach to solve it, without the use of FOR loop and taking your dataframe as is, and a very concise (non verbose) solution
df['diff']= df.assign(prev_id = df['set_UserId'].shift(1) ,
prev_src = df['CampaignSource'] == df['CampaignSource'].shift(1)
).apply(lambda row: set(row['prev_id']).difference(set(row['set_UserId']))
if row['prev_src']
else "", axis=1).astype(str).replace(r'{|}|\'','',regex=True)
Steps CampaignSource set_UserId diff
0 One Pot a,b,c,d,e
1 Two Pot a,b,c,d e
2 Three Pot a,b,c d
3 One Daddy e,f,g,h
4 Two Daddy e,f,g h
5 Three Daddy e,f g
here is how it works, two temporary columns are introduced which essentially are the previous rows values, one is for set_userid and other is to check if CampaignSource has changed. then a set notation is used to find the difference b/w the previous and next value. Since you need the values, and returned result from lambda is a set, parenthesis are removed.