Context:
So I have a multi-index df that contains data on how users have traversed through our website. The df is a as follows:
df = df.loc[df['eventType'] == 5]
df.set_index(['id_1','id_2']).sort_values('timestamp').head()
sample df:
eventType timestamp Step_A Step_B Step_C Step_D
id id_2
1 abc 5 t_1 Action_A NA NA NA
abc 5 t_2 Action_A Action_B NA NA
abc 5 t_3 Action_A Action_B Action_C NA
abc 5 t_4 Action_A Action_B Action_C Exit Website
2 ijk 5 t_4 Action_A NA NA NA
ijk 5 t_5 Action_A Action_B NA NA
ijk 5 t_6 Action_A NA NA NA
3 . . . . . . .
. . . . . . .
id
= 1 is an example of somebody who "finished" their path through the website (Action A-C, then left the website on action D). However, in some instances users take a much more complex path or don't exit the website properly and the Exit Website
page isn't recorded (such as id
= 2). I'm trying to fill in those values by sorting by time-stamp and making sure that every individual's session has an "Exit Webpage" on the last row in either column A-D. If there is no "Exit Webpage" in a session (session is the id
), I want to put "Exit Webpage" on the first "Na" or "None" column observed and keep the rest. However if there is a an "Exit Webpage", I'd want to leave it unchanged.
Example:
Before -
eventType timestamp Step_A Step_B Step_C Step_D
id id_2
1 abc 5 t_1 Action_A NA NA NA
abc 5 t_2 Action_A Action_B NA NA
abc 5 t_3 Action_A Action_B Action_C NA
abc 5 t_4 Action_A Action_B Action_C Exit Website
2 ijk 5 t_4 Action_A NA NA NA
ijk 5 t_5 Action_A Action_B NA NA
ijk 5 t_6 Action_A NA NA NA
3 . . . . . . .
. . . . . . .
After -
eventType timestamp Step_A Step_B Step_C Step_D
id id_2
1 abc 5 t_1 Action_A
abc 5 t_2 Action_A Action_B
abc 5 t_3 Action_A Action_B Action_C
abc 5 t_4 Action_A Action_B Action_C Exit Website
2 ijk 5 t_5 Action_A NA NA NA
ijk 5 t_6 Action_A Action_B NA NA
ijk 5 t_7 Action_A NA NA NA
ijk 5 t_8 Action_A Exit Website NA NA
3 . . . . . . .
. . . . . . .
psuedo-code:
for the last row for every `id` in my multi-index df:
if column A or B or C or D have an NA:
df[Column] = replace first na with "Exit Webpage" for that row
else:
pass
CodePudding user response:
df.reset_index(inplace=True)
m1 = ~df.duplicated(['id', 'eventType'], keep='last')
m2 = df.duplicated(['id', 'eventType'], keep=False)
last_group_row = m1 & m2
step_a_empty = df.groupby(['id', 'eventType'])['Step_A'].transform(lambda x: x.isnull().all() and 'Exit Website' not in x)
step_b_empty = df.groupby(['id', 'eventType'])['Step_B'].transform(lambda x: x.isnull().all() and 'Exit Website' not in x)
step_c_empty = df.groupby(['id', 'eventType'])['Step_C'].transform(lambda x: x.isnull().all() and 'Exit Website' not in x)#.apply(lambda x: x.isnull().all())
step_d_empty = df.groupby(['id', 'eventType'])['Step_D'].transform(lambda x: x.isnull().all() and 'Exit Website' not in x)#.apply(lambda x: x.isnull().all())
update_a = step_b_empty & step_c_empty & step_d_empty
update_a_last_row = last_group_row & update_a
df.loc[update_a_last_row, 'Step_A'] = 'Exit Website'
update_b = df.groupby(['id', 'eventType'])['Step_A'].transform(lambda x: not 'Exit Website' in x.tolist()) #& step_c_empty & step_d_empty
update_b_last_row = last_group_row & update_b & step_c_empty & step_d_empty
df.loc[update_b_last_row, 'Step_B'] = 'Exit Website'
update_c = df.groupby(['id', 'eventType'])[['Step_A', 'Step_B']].transform(lambda x: not 'Exit Website' in x.tolist()) #& step_c_empty & step_d_empty
update_c = update_c.all(axis='columns')
update_c_last_row = last_group_row & update_c & step_d_empty
df.loc[update_c_last_row, 'Step_C'] = 'Exit Website'
update_d = df.groupby(['id', 'eventType'])[['Step_A', 'Step_B', 'Step_C', 'Step_D']].transform(lambda x: not 'Exit Website' in x.tolist()) #& step_c_empty & step_d_empty
update_d = update_d.all(axis='columns')
update_d_last_row = last_group_row & update_d
df.loc[update_d_last_row, 'Step_D'] = 'Exit Website'
df.set_index(['id', 'eventType'], inplace=True)
Original DataFrame:
timestamp Step_A Step_B Step_C Step_D
id eventType
1 5 t_1 Action_A NaN NaN NaN
5 t_2 Action_A Action_B NaN NaN
5 t_3 Action_A Action_B Action_C NaN
5 t_4 Action_A Action_B Action_C Exit Website
2 5 t_5 Action_A NaN NaN NaN
5 t_6 Action_A NaN NaN NaN
5 t_7 NaN NaN NaN NaN
3 5 t_8 Action_A Action_B NaN NaN
5 t_9 Action_A Action_B NaN NaN
5 t_10 Action_A NaN NaN NaN
4 5 t_11 Action_A Action_B Action_C NaN
5 t_12 Action_A Action_B Action_C NaN
5 t_13 Action_A Action_B NaN NaN
Final DataFrame:
timestamp Step_A Step_B Step_C Step_D
id eventType
1 5 t_1 Action_A NaN NaN NaN
5 t_2 Action_A Action_B NaN NaN
5 t_3 Action_A Action_B Action_C NaN
5 t_4 Action_A Action_B Action_C Exit Website
2 5 t_5 Action_A NaN NaN NaN
5 t_6 Action_A NaN NaN NaN
5 t_7 Exit Website NaN NaN NaN
3 5 t_8 Action_A Action_B NaN NaN
5 t_9 Action_A Action_B NaN NaN
5 t_10 Action_A Exit Website NaN NaN
4 5 t_11 Action_A Action_B Action_C NaN
5 t_12 Action_A Action_B Action_C NaN
5 t_13 Action_A Action_B Exit Website NaN
SOLUTION 2:
df.reset_index(inplace=True)
dfs = []
step_cols = [col for col in list(df) if col.startswith('Step_')]
print(step_cols)
for group_name, df_group in df.groupby(['id', 'eventType']):
print('='*50, group_name, '='*50)
if 'Exit Website' not in df_group.values:
print(df_group)
for col in reversed(step_cols):
if df_group[col].isnull().all():
print(f'Col: {col} is all null')
else:
print(f'Col: {col} contain values')
df_group[col].iloc[-1] = 'Exit Website'
break
dfs.append(df_group)
print('---')
print(df_group)
else:
dfs.append(df_group)
df_final = pd.concat(dfs, ignore_index=True)
print('================================================')
df_final.set_index(['id', 'eventType'], inplace=True)
print(df_final)