Pandas - Remove first occurrence of value if it matches certain condition-CodePudding

I am looking to remove END from 'Task' column if there is no START before it. The data can be grouped by 'Session' and if the first occurance of the 'Task' is END, then I want to replace that specific occurrence with nAn value.

	Session	Task
0	1
1	1	END
2	1
3	1	START
4	1
5	1	END
6	2
7	2	START
8	2
9	2	END
10	2
11	2
12	3
13	3	START
14	3
15	3
16	4
17	4	START
18	4
18	4
18	4	END

the DataFrame

import pandas as pd
d = {'Session':[1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4],
         'Task':['', 'END', '', 'START', '', 'END', '', 'START', '', 'END', '', '', '', 'START', '', '', '', 'START', '', '', 'END']}

df = pd.DataFrame(data=d)

My initial thought was to get the first occurrence of 'Task' for each group, in a different data frame df2, and filter only rows with 'END' value, and then use index of df2 to remove the value from the original df.

Below is the expected table

	Session	Task
0	1
1	1
2	1
3	1	START
4	1
5	1	END
6	2
7	2	START
8	2
9	2	END
10	2
11	2
12	3
13	3	START
14	3
15	3
16	4
17	4	START
18	4
18	4
18	4	END

CodePudding user response：

Using Dataframe apply with a custom function to remove unmatched "END"

Code

def remove_unmatched(x):
    ' removes "END" value when not preceeded by "START" '
    preceeded = False
    result = []
    for z in x:
        if z == "START":
            preceeded = True                       # Set preceeded to True since found a start
            result.append(z)
        elif z == "END":
            result.append(z if preceeded else "")  # "END" or "" based upon whether preceede by "START"
            preceeded = False
        else:
            result.append(z)                       # value lunchanged
            
    return pd.Series(result, index = x.index)  # new series

df['Task'] = df.groupby('Session')['Task'].apply(remove_unmatched) # provides desired df

CodePudding user response：

import numpy as np

idx = df.where(df.groupby('Session').first().eq('END'))['Task'] == 'END'
df.loc[idx, 'Task'] = np.nan

How does it work?

Basically, you group per session and take the first non NaN value which is equal to END. That is, that session starts with an END. Then select all rows in the dataframe that matches that, for the Task column. It's necessary to add == 'END' such that there's a boolean column for indexing.

Then you can just locate those cases in the Task column and assign NaN to them.