How can I delete a sequence of rows based on a condition?-CodePudding

I have the following dataframe:

    id outcome
0    3      no
1    3      no
2    3      no
3    3     yes
4    3      no
5    5      no
6    5      no
7    5     yes
8    5      no
9    5     yes
10   6      no
11   6      no
12   6     yes
13   6      no
14   6      no

I want to remove the no outcomes at the start of a sequence before a yes, and keep all other no outcomes, so the output dataframe looks like this:

    id outcome
3    3     yes
4    3      no
7    5     yes
8    5      no
9    5     yes
12   6     yes
13   6      no
14   6      no

At the moment I have tried this:

df = pd.DataFrame(data={
       'id': [3, 3, 3, 3, 3, 5, 5, 5, 5, 6, 6, 6, 6, 6], 
       'outcome': ['no', 'no', 'no', 'yes', 'no', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'no', 'no']
     })


df = df[df.groupby('id').outcome.transform(lambda x: x.ne('no'))]

However, this simply removes all no outcomes.

I know I then need to take the index of these rows and remove them from the dataframe. Any suggestions?

CodePudding user response：

Use groupby with cumsum to mark all 'no' at the start with a 0:

df['no_group'] = df.groupby('id')['outcome'].apply(lambda x: x.eq('yes').cumsum())

Now, the number of 'no's to remove is:

num_no_to_remove = (df['no_group'] == 0).sum()

And the wanted dataframe can be obtained by filtering:

df.loc[df['no_group'] > 0].drop(columns=['no_group'])

Result:

    id  outcome
3    3      yes
4    3       no
7    5      yes
8    5       no
9    5      yes
12   6      yes
13   6       no
14   6       no

CodePudding user response：

For keeping only last no values of each group and all the yes values, this code will do the trick:

df = df[(df.replace({'no': np.nan, 'yes': 1}).groupby('id')['outcome'].bfill() != 1) | (df['outcome'] == 'yes')]

Output:

>>> df
    id outcome
3    3     yes
4    3      no
5    3      no
8    5     yes
9    5     yes
12   6     yes

(In the original df, I added a second no the end of group 3 to make sure it works for multiple no's at the end).

Essentially what the code does is it

Replaces yes values with an arbitrary value (1) in this case
Replaces no values with NaN (which is important!)
Groups the rows by their ID
For each group, replace all NaN rows coming before the last non-NaN row with the value of the last non-NaN row. Since the yes's are 1 and the no's are NaN, this will cause everything except the last no's of the group to be replaced with the arbitrary number (1)
Creates a mask which selects all those last no values of each group
Creates a second mask which selects all yes values
Uses those two masks combined to return , and all yes values, and all no values that are at the end of a group.

For the question regarding the count of no's at the beginning, I think you should ask a new question for that, because it's a different problem that has to be solved differently.