Iterate through a multi-index DataFrame and Edit certain rows-CodePudding

Context:

So I have a multi-index df that contains data on how users have traversed through our website. The df is a as follows:

df = df.loc[df['eventType'] == 5]
df.set_index(['id_1','id_2']).sort_values('timestamp').head()

sample df:

            eventType timestamp   Step_A     Step_B       Step_C      Step_D
id id_2

1   abc      5        t_1      Action_A     NA              NA            NA
    abc      5        t_2      Action_A   Action_B          NA            NA
    abc      5        t_3      Action_A   Action_B       Action_C         NA
    abc      5        t_4      Action_A   Action_B       Action_C     Exit Website


2   ijk     5        t_4      Action_A      NA              NA            NA
    ijk     5        t_5      Action_A   Action_B           NA            NA
    ijk     5        t_6      Action_A      NA              NA            NA

3    .      .         .         .           .           .           .
     .      .         .         .           .           .           .

id = 1 is an example of somebody who "finished" their path through the website (Action A-C, then left the website on action D). However, in some instances users take a much more complex path or don't exit the website properly and the Exit Website page isn't recorded (such as id = 2). I'm trying to fill in those values by sorting by time-stamp and making sure that every individual's session has an "Exit Webpage" on the last row in either column A-D. If there is no "Exit Webpage" in a session (session is the id), I want to put "Exit Webpage" on the first "Na" or "None" column observed and keep the rest. However if there is a an "Exit Webpage", I'd want to leave it unchanged.

Example:

Before -

         eventType timestamp   Step_A     Step_B       Step_C       Step_D
id id_2

1   abc      5        t_1     Action_A      NA         NA            NA
    abc      5        t_2     Action_A   Action_B      NA            NA
    abc      5        t_3     Action_A   Action_B   Action_C         NA
    abc      5        t_4     Action_A   Action_B   Action_C   Exit Website


2   ijk     5        t_4      Action_A    NA              NA            NA
    ijk     5        t_5      Action_A  Action_B          NA            NA
    ijk     5        t_6      Action_A    NA              NA            NA

3    .      .         .         .           .           .           .
     .      .         .         .           .           .           .

After -

         eventType timestamp   Step_A     Step_B       Step_C    Step_D
id id_2

1   abc      5        t_1      Action_A    
    abc      5        t_2      Action_A   Action_B
    abc      5        t_3      Action_A   Action_B   Action_C
    abc      5        t_4      Action_A   Action_B   Action_C   Exit Website



2   ijk      5        t_5      Action_A      NA         NA          NA
    ijk      5        t_6      Action_A  Action_B       NA          NA
    ijk      5        t_7      Action_A      NA         NA          NA
    ijk      5        t_8      Action_A  Exit Website   NA          NA 
 
3    .      .         .         .           .           .           .
     .      .         .         .           .           .           .

psuedo-code:

for the last row for every `id` in my multi-index df:
 
   if column A or B or C or D have an NA:

         df[Column] = replace first na with "Exit Webpage" for that row

   else:
         pass

CodePudding user response：

df.reset_index(inplace=True)
m1 = ~df.duplicated(['id', 'eventType'], keep='last')
m2 = df.duplicated(['id', 'eventType'], keep=False)
last_group_row = m1 & m2

step_a_empty = df.groupby(['id', 'eventType'])['Step_A'].transform(lambda x: x.isnull().all() and 'Exit Website' not in x)
step_b_empty = df.groupby(['id', 'eventType'])['Step_B'].transform(lambda x: x.isnull().all() and 'Exit Website' not in x)
step_c_empty = df.groupby(['id', 'eventType'])['Step_C'].transform(lambda x: x.isnull().all() and 'Exit Website' not in x)#.apply(lambda x: x.isnull().all())
step_d_empty = df.groupby(['id', 'eventType'])['Step_D'].transform(lambda x: x.isnull().all() and 'Exit Website' not in x)#.apply(lambda x: x.isnull().all())

update_a = step_b_empty & step_c_empty & step_d_empty
update_a_last_row = last_group_row & update_a
df.loc[update_a_last_row, 'Step_A'] = 'Exit Website'

update_b = df.groupby(['id', 'eventType'])['Step_A'].transform(lambda x: not 'Exit Website' in x.tolist()) #& step_c_empty & step_d_empty
update_b_last_row = last_group_row & update_b & step_c_empty & step_d_empty
df.loc[update_b_last_row, 'Step_B'] = 'Exit Website'

update_c = df.groupby(['id', 'eventType'])[['Step_A', 'Step_B']].transform(lambda x: not 'Exit Website' in x.tolist()) #& step_c_empty & step_d_empty
update_c = update_c.all(axis='columns')
update_c_last_row = last_group_row & update_c & step_d_empty
df.loc[update_c_last_row, 'Step_C'] = 'Exit Website'

update_d = df.groupby(['id', 'eventType'])[['Step_A', 'Step_B', 'Step_C', 'Step_D']].transform(lambda x: not 'Exit Website' in x.tolist()) #& step_c_empty & step_d_empty
update_d = update_d.all(axis='columns')
update_d_last_row = last_group_row & update_d
df.loc[update_d_last_row, 'Step_D'] = 'Exit Website'
df.set_index(['id', 'eventType'], inplace=True)

Original DataFrame:

               timestamp    Step_A    Step_B    Step_C        Step_D
id eventType                                                      
1  5               t_1  Action_A       NaN       NaN           NaN
   5               t_2  Action_A  Action_B       NaN           NaN
   5               t_3  Action_A  Action_B  Action_C           NaN
   5               t_4  Action_A  Action_B  Action_C  Exit Website
2  5               t_5  Action_A       NaN       NaN           NaN
   5               t_6  Action_A       NaN       NaN           NaN
   5               t_7       NaN       NaN       NaN           NaN
3  5               t_8  Action_A  Action_B       NaN           NaN
   5               t_9  Action_A  Action_B       NaN           NaN
   5              t_10  Action_A       NaN       NaN           NaN
4  5              t_11  Action_A  Action_B  Action_C           NaN
   5              t_12  Action_A  Action_B  Action_C           NaN
   5              t_13  Action_A  Action_B       NaN           NaN

Final DataFrame:

               timestamp        Step_A        Step_B        Step_C        Step_D
id eventType                                                                  
1  5               t_1      Action_A           NaN           NaN           NaN
   5               t_2      Action_A      Action_B           NaN           NaN
   5               t_3      Action_A      Action_B      Action_C           NaN
   5               t_4      Action_A      Action_B      Action_C  Exit Website
2  5               t_5      Action_A           NaN           NaN           NaN
   5               t_6      Action_A           NaN           NaN           NaN
   5               t_7  Exit Website           NaN           NaN           NaN
3  5               t_8      Action_A      Action_B           NaN           NaN
   5               t_9      Action_A      Action_B           NaN           NaN
   5              t_10      Action_A  Exit Website           NaN           NaN
4  5              t_11      Action_A      Action_B      Action_C           NaN
   5              t_12      Action_A      Action_B      Action_C           NaN
   5              t_13      Action_A      Action_B  Exit Website           NaN

SOLUTION 2:

df.reset_index(inplace=True)
dfs = []
step_cols = [col for col in list(df) if col.startswith('Step_')]
print(step_cols)


for group_name, df_group in df.groupby(['id', 'eventType']):
    print('='*50, group_name, '='*50)
    if 'Exit Website' not in df_group.values:
       print(df_group)
       for col in reversed(step_cols):
           if df_group[col].isnull().all():
               print(f'Col: {col} is all null')
           else:
               print(f'Col: {col} contain values')
               df_group[col].iloc[-1] = 'Exit Website'
               break

       dfs.append(df_group)

       print('---')
       print(df_group)
    else:
        dfs.append(df_group)

df_final = pd.concat(dfs, ignore_index=True)
print('================================================')
df_final.set_index(['id', 'eventType'], inplace=True)
print(df_final)