Suppose we have the following already sorted dataset:
ID Dead
01 F
01 F
01 T
01 T
01 T
02 F
02 F
02 F
02 F
02 T
03 T
03 T
03 T
03 T
03 T
We have 3 IDs (01, 02, and 03) and whether the individual is dead (True or False). I want the indices where the individuals are alive and the initial row when the individual died, which would leave me with the following dataset:
ID Dead
0 01 F
1 01 F
2 01 T
5 02 F
6 02 F
7 02 F
8 02 F
9 02 T
10 03 T
I came up with a solution that involves looping over all rows and appending the ID to a list if they have died previously. Is there a quicker approach?
Edit: It also has to be in order. Data is not "perfect", for example, we might have the following dataset:
ID Dead
04 F
04 T
04 F
04 F
04 F
And the desired output is:
ID Dead
04 F
04 T
CodePudding user response:
IIUC you want "Dead" equal "F" OR not duplicated on "ID Dead"
You can use boolean indexing:
m1 = df['Dead'].eq('F')
m2 = ~df.duplicated(['ID', 'Dead'])
df[m1|m2] # keep if either mask is True
output:
ID Dead
0 1 F
1 1 F
2 1 T
5 2 F
6 2 F
7 2 F
8 2 F
9 2 T
10 3 T
CodePudding user response:
You can try with groupby
with transform
idxmax
out = df[df.index<=df['Dead'].eq('T').groupby(df['ID']).transform('idxmax')]
Out[545]:
ID Dead
0 1 F
1 1 F
2 1 T
5 2 F
6 2 F
7 2 F
8 2 F
9 2 T
10 3 T
Or
out = df[df['Dead'].eq('T').groupby(df['ID']).cumsum()<=1]
Out[546]:
ID Dead
0 1 F
1 1 F
2 1 T
5 2 F
6 2 F
7 2 F
8 2 F
9 2 T
10 3 T
For update example
out = df[df.index<=df['Dead'].eq('T').groupby(df['ID']).transform('idxmax')]
out
Out[552]:
ID Dead
0 1 F
1 1 F
2 1 T
5 2 F
6 2 F
7 2 F
8 2 F
9 2 T
10 3 T
15 4 F
16 4 T