I have the following dataframe:
id outcome
0 3 no
1 3 no
2 3 no
3 3 yes
4 3 no
5 5 no
6 5 no
7 5 yes
8 5 no
9 5 yes
10 6 no
11 6 no
12 6 yes
13 6 no
14 6 no
I want to remove the no
outcomes at the start of a sequence before a yes
, and keep all other no
outcomes, so the output dataframe looks like this:
id outcome
3 3 yes
4 3 no
7 5 yes
8 5 no
9 5 yes
12 6 yes
13 6 no
14 6 no
At the moment I have tried this:
df = pd.DataFrame(data={
'id': [3, 3, 3, 3, 3, 5, 5, 5, 5, 6, 6, 6, 6, 6],
'outcome': ['no', 'no', 'no', 'yes', 'no', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'no', 'no']
})
df = df[df.groupby('id').outcome.transform(lambda x: x.ne('no'))]
However, this simply removes all no
outcomes.
I know I then need to take the index of these rows and remove them from the dataframe. Any suggestions?
CodePudding user response:
Use groupby
with cumsum
to mark all 'no' at the start with a 0:
df['no_group'] = df.groupby('id')['outcome'].apply(lambda x: x.eq('yes').cumsum())
Now, the number of 'no's to remove is:
num_no_to_remove = (df['no_group'] == 0).sum()
And the wanted dataframe can be obtained by filtering:
df.loc[df['no_group'] > 0].drop(columns=['no_group'])
Result:
id outcome
3 3 yes
4 3 no
7 5 yes
8 5 no
9 5 yes
12 6 yes
13 6 no
14 6 no
CodePudding user response:
For keeping only last no
values of each group and all the yes
values, this code will do the trick:
df = df[(df.replace({'no': np.nan, 'yes': 1}).groupby('id')['outcome'].bfill() != 1) | (df['outcome'] == 'yes')]
Output:
>>> df
id outcome
3 3 yes
4 3 no
5 3 no
8 5 yes
9 5 yes
12 6 yes
(In the original df
, I added a second no
the end of group 3
to make sure it works for multiple no
's at the end).
Essentially what the code does is it
- Replaces
yes
values with an arbitrary value (1
) in this case - Replaces
no
values with NaN (which is important!) - Groups the rows by their ID
- For each group, replace all NaN rows coming before the last non-NaN row with the value of the last non-NaN row. Since the
yes
's are1
and theno
's are NaN, this will cause everything except the lastno
's of the group to be replaced with the arbitrary number (1
) - Creates a mask which selects all those last
no
values of each group - Creates a second mask which selects all
yes
values - Uses those two masks combined to return , and all
yes
values, and allno
values that are at the end of a group.
For the question regarding the count of no
's at the beginning, I think you should ask a new question for that, because it's a different problem that has to be solved differently.