Home > OS >  Pandas add label column based on conditions
Pandas add label column based on conditions

Time:10-15

Setup:

df_input = pd.DataFrame(data={'Session': [0, 0, 0, 0, 0, 0, 0, 0, 0,  1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2],
                              'Action': ['Start', 'Action1', 'Action2', 'Action3', 'Pause', 'Action4',
                                         'Action5', 'Resume', 'Finish', 'Start', 'Action1', 'Action2', 'Pause',
                                         'Action3', 'Action4', 'Resume', 'Finish', 'Start', 'Action1',
                                         'Action2', 'Finish']})

df_output = pd.DataFrame(data={'Session': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2], 
                               'Action': ['Start', 'Action1', 'Action2', 'Action3', 'Pause', 'Action4', 'Action5', 
                                          'Resume', 'Finish', 'Start', 'Action1', 'Action2', 'Pause', 'Action3', 
                                          'Action4', 'Resume', 'Finish', 'Start', 'Action1', 'Action2', 'Finish'], 
                               'Label': [0, 1, 2, 3, 4, 4, 4, 4, 5, 0, 1, 2, 3, 3, 3, 3, 4, 0, 1, 2, 3]})

Input:

    Session   Action
0         0    Start
1         0  Action1
2         0  Action2
3         0  Action3
4         0    Pause
5         0  Action4
6         0  Action5
7         0   Resume
8         0   Finish
9         1    Start
10        1  Action1
11        1  Action2
12        1    Pause
13        1  Action3
14        1  Action4
15        1   Resume
16        1   Finish
17        2    Start
18        2  Action1
19        2  Action2
20        2   Finish

Output:

     Session   Action  Label
0         0    Start      0
1         0  Action1      1
2         0  Action2      2
3         0  Action3      3
4         0    Pause      4 #Same
5         0  Action4      4 #Same
6         0  Action5      4 #Same
7         0   Resume      4 #Same
8         0   Finish      5
9         1    Start      0
10        1  Action1      1
11        1  Action2      2
12        1    Pause      3 #Same
13        1  Action3      3 #Same
14        1  Action4      3 #Same
15        1   Resume      3 #Same
16        1   Finish      4
17        2    Start      0
18        2  Action1      1
19        2  Action2      2
20        2   Finish      3

Labeling Rule: In every session, label remains same from pause until resumed.

I had been trying for so long but was unable to make labels same from pause till resumed

CodePudding user response:

This only requires some logic to be written, refer -

label_list, label, paused = [], 0, 0
for index, row in df_input.iterrows():
  #print(index, row['Action'])
  if paused and not row['Action'] == 'Resume':
    label_list.append(label)
  elif row['Action'] == 'Resume':
    label_list.append(label)
    label =1
    paused=0
  else:
    if row['Action'] == 'Pause':
      #label =1
      label_list.append(label)
      paused = 1
    elif row['Action'] == 'Finish':
      #label =1
      label_list.append(label)
      label=0
    else:
      label_list.append(label)
      label =1

Output-

  Session Action Label
0   0   Start   0
1   0   Action1 1
2   0   Action2 2
3   0   Action3 3
4   0   Pause   4
5   0   Action4 4
6   0   Action5 4
7   0   Resume  4
8   0   Finish  5
9   1   Start   0
10  1   Action1 1
11  1   Action2 2
12  1   Pause   3
13  1   Action3 3
14  1   Action4 3
15  1   Resume  3
16  1   Finish  4
17  2   Start   0
18  2   Action1 1
19  2   Action2 2
20  2   Finish  3

CodePudding user response:

We can try creating a mask to determine exactly where values are paused, then we can mask out the Paused rows before taking the total within group:

m = (
    df_input['Action']
        .where(df_input['Action'].isin(['Pause', 'Resume']))
        .groupby(df_input['Session']).shift()
        .ffill().eq('Pause')
)
df_input['Label'] = (
        df_input['Action'].mask(m).notna()
        .groupby(df_input['Session']).cumsum() - 1
)
    Session   Action  Label
0         0    Start      0
1         0  Action1      1
2         0  Action2      2
3         0  Action3      3
4         0    Pause      4
5         0  Action4      4
6         0  Action5      4
7         0   Resume      4
8         0   Finish      5
9         1    Start      0
10        1  Action1      1
11        1  Action2      2
12        1    Pause      3
13        1  Action3      3
14        1  Action4      3
15        1   Resume      3
16        1   Finish      4
17        2    Start      0
18        2  Action1      1
19        2  Action2      2
20        2   Finish      3

Breakdown of steps in a DataFrame for reference:

pd.DataFrame({
    1: df_input['Action']
        .where(df_input['Action'].isin(['Pause', 'Resume'])),
    2: df_input['Action']
        .where(df_input['Action'].isin(['Pause', 'Resume']))
        .groupby(df_input['Session']).shift(),
    3: df_input['Action']
        .where(df_input['Action'].isin(['Pause', 'Resume']))
        .groupby(df_input['Session']).shift()
        .ffill(),
    4: df_input['Action']
        .where(df_input['Action'].isin(['Pause', 'Resume']))
        .groupby(df_input['Session']).shift()
        .ffill().eq('Pause'),
    5: df_input['Action'].mask(m),
    6: df_input['Action'].mask(m).notna(),
    7: df_input['Action'].mask(m).notna()
        .groupby(df_input['Session']).cumsum() - 1
}).rename_axis(columns='Steps')
Steps       1       2       3      4        5      6  7
0         NaN     NaN     NaN  False    Start   True  0
1         NaN     NaN     NaN  False  Action1   True  1
2         NaN     NaN     NaN  False  Action2   True  2
3         NaN     NaN     NaN  False  Action3   True  3
4       Pause     NaN     NaN  False    Pause   True  4
5         NaN   Pause   Pause   True      NaN  False  4
6         NaN     NaN   Pause   True      NaN  False  4
7      Resume     NaN   Pause   True      NaN  False  4
8         NaN  Resume  Resume  False   Finish   True  5
9         NaN     NaN  Resume  False    Start   True  0
10        NaN     NaN  Resume  False  Action1   True  1
11        NaN     NaN  Resume  False  Action2   True  2
12      Pause     NaN  Resume  False    Pause   True  3
13        NaN   Pause   Pause   True      NaN  False  3
14        NaN     NaN   Pause   True      NaN  False  3
15     Resume     NaN   Pause   True      NaN  False  3
16        NaN  Resume  Resume  False   Finish   True  4
17        NaN     NaN  Resume  False    Start   True  0
18        NaN     NaN  Resume  False  Action1   True  1
19        NaN     NaN  Resume  False  Action2   True  2
20        NaN     NaN  Resume  False   Finish   True  3

Docs for Methods Used:

  1. Series.where
  2. Series.isin
  3. Groupby.shift
  4. Series.ffill
  5. Series.eq
  6. Series.mask
  7. Series.notna
  8. Groupby.cumsum
  • Related