I am looking to remove END from 'Task' column if there is no START before it. The data can be grouped by 'Session' and if the first occurance of the 'Task' is END, then I want to replace that specific occurrence with nAn value.
Session | Task | |
---|---|---|
0 | 1 | |
1 | 1 | END |
2 | 1 | |
3 | 1 | START |
4 | 1 | |
5 | 1 | END |
6 | 2 | |
7 | 2 | START |
8 | 2 | |
9 | 2 | END |
10 | 2 | |
11 | 2 | |
12 | 3 | |
13 | 3 | START |
14 | 3 | |
15 | 3 | |
16 | 4 | |
17 | 4 | START |
18 | 4 | |
18 | 4 | |
18 | 4 | END |
the DataFrame
import pandas as pd
d = {'Session':[1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4],
'Task':['', 'END', '', 'START', '', 'END', '', 'START', '', 'END', '', '', '', 'START', '', '', '', 'START', '', '', 'END']}
df = pd.DataFrame(data=d)
My initial thought was to get the first occurrence of 'Task' for each group, in a different data frame df2, and filter only rows with 'END' value, and then use index of df2 to remove the value from the original df.
Below is the expected table
Session | Task | |
---|---|---|
0 | 1 | |
1 | 1 | |
2 | 1 | |
3 | 1 | START |
4 | 1 | |
5 | 1 | END |
6 | 2 | |
7 | 2 | START |
8 | 2 | |
9 | 2 | END |
10 | 2 | |
11 | 2 | |
12 | 3 | |
13 | 3 | START |
14 | 3 | |
15 | 3 | |
16 | 4 | |
17 | 4 | START |
18 | 4 | |
18 | 4 | |
18 | 4 | END |
CodePudding user response:
Using Dataframe apply with a custom function to remove unmatched "END"
Code
def remove_unmatched(x):
' removes "END" value when not preceeded by "START" '
preceeded = False
result = []
for z in x:
if z == "START":
preceeded = True # Set preceeded to True since found a start
result.append(z)
elif z == "END":
result.append(z if preceeded else "") # "END" or "" based upon whether preceede by "START"
preceeded = False
else:
result.append(z) # value lunchanged
return pd.Series(result, index = x.index) # new series
df['Task'] = df.groupby('Session')['Task'].apply(remove_unmatched) # provides desired df
CodePudding user response:
import numpy as np
idx = df.where(df.groupby('Session').first().eq('END'))['Task'] == 'END'
df.loc[idx, 'Task'] = np.nan
How does it work?
Basically, you group per session and take the first non NaN
value which is equal to END
. That is, that session starts with an END
. Then select all rows in the dataframe that matches that, for the Task
column. It's necessary to add == 'END'
such that there's a boolean column for indexing.
Then you can just locate those cases in the Task column
and assign NaN
to them.