How to remove all entries of a specific ID after a binary variable becomes true in Pandas?-CodePudding

Suppose we have the following already sorted dataset:

We have 3 IDs (01, 02, and 03) and whether the individual is dead (True or False). I want the indices where the individuals are alive and the initial row when the individual died, which would leave me with the following dataset:

    ID Dead
 0  01    F
 1  01    F
 2  01    T
 5  02    F 
 6  02    F
 7  02    F
 8  02    F
 9  02    T
10  03    T

I came up with a solution that involves looping over all rows and appending the ID to a list if they have died previously. Is there a quicker approach?

Edit: It also has to be in order. Data is not "perfect", for example, we might have the following dataset:

ID Dead
04    F
04    T
04    F
04    F
04    F

And the desired output is:

ID Dead
04    F
04    T

CodePudding user response：

IIUC you want "Dead" equal "F" OR not duplicated on "ID Dead"

You can use boolean indexing:

m1 = df['Dead'].eq('F')
m2 = ~df.duplicated(['ID', 'Dead'])

df[m1|m2]  # keep if either mask is True

output:

    ID Dead
0    1    F
1    1    F
2    1    T
5    2    F
6    2    F
7    2    F
8    2    F
9    2    T
10   3    T

CodePudding user response：

You can try with groupby with transform idxmax

out = df[df.index<=df['Dead'].eq('T').groupby(df['ID']).transform('idxmax')]
Out[545]: 
    ID Dead
0    1    F
1    1    F
2    1    T
5    2    F
6    2    F
7    2    F
8    2    F
9    2    T
10   3    T

out = df[df['Dead'].eq('T').groupby(df['ID']).cumsum()<=1]
Out[546]: 
    ID Dead
0    1    F
1    1    F
2    1    T
5    2    F
6    2    F
7    2    F
8    2    F
9    2    T
10   3    T

For update example

out = df[df.index<=df['Dead'].eq('T').groupby(df['ID']).transform('idxmax')]
out
Out[552]: 
    ID Dead
0    1    F
1    1    F
2    1    T
5    2    F
6    2    F
7    2    F
8    2    F
9    2    T
10   3    T
15   4    F
16   4    T