Given the following Python DataFrame:
ID | direction | other | color | date |
---|---|---|---|---|
0 | OUT | TT | red | 2022-01-03 |
0 | IN | TE | red | 2022-01-03 |
0 | OUT | YE | red | 2022-01-04 |
0 | IN | PO | red | 2022-01-05 |
0 | IN | PT | red | 2022-01-06 |
1 | OUT | PO | blue | 2022-01-04 |
2 | IN | PO | green | 2022-01-06 |
2 | IN | KO | green | 2022-01-07 |
3 | IN | XE | red | 2022-01-04 |
3 | OUT | PA | red | 2022-01-05 |
4 | IN | MY | white | 2022-01-01 |
I want to get a DataFrame from the previous one, such that it gets for each ID, all the rows following the last direction=OUT
row recorded. If only rows with direction=IN
exist for an ID, it will also return it in the new DataFrame, even if there is no previous direction=OUT
row for that ID.
Result from the above example:
ID | direction | other | color | date |
---|---|---|---|---|
0 | IN | PO | red | 2022-01-05 |
0 | IN | PT | red | 2022-01-06 |
2 | IN | PO | green | 2022-01-06 |
2 | IN | KO | green | 2022-01-07 |
4 | IN | MY | white | 2022-01-01 |
Thank you in advance for your help.
CodePudding user response:
You can swap order in original DataFrame
by slice [::-1]
, then use GroupBy.cummax
with condition direction='OUT'
, change order by original and pass to boolean indexing
with invert mask by ~
:
df1 = df.iloc[::-1]
df = df[~df1['direction'].eq('OUT').groupby(df1['ID']).cummax().iloc[::-1]]
print (df)
ID direction other color date
3 0 IN PO red 2022-01-05
4 0 IN PT red 2022-01-06
6 2 IN PO green 2022-01-06
7 2 IN KO green 2022-01-07
10 4 IN MY white 2022-01-01
How it working:
Swapping rows:
print (df.iloc[::-1])
ID direction other color date
10 4 IN MY white 2022-01-01
9 3 OUT PA red 2022-01-05
8 3 IN XE red 2022-01-04
7 2 IN KO green 2022-01-07
6 2 IN PO green 2022-01-06
5 1 OUT PO blue 2022-01-04
4 0 IN PT red 2022-01-06
3 0 IN PO red 2022-01-05
2 0 OUT YE red 2022-01-04
1 0 IN TE red 2022-01-03
0 0 OUT TT red 2022-01-03
Get cumulative max:
print (df1['direction'].eq('OUT').groupby(df1['ID']).cummax())
10 False
9 True
8 True
7 False
6 False
5 True
4 False
3 False
2 True
1 True
0 True
Name: direction, dtype: bool
Inverse mask and swap order of values:
print (~df1['direction'].eq('OUT').groupby(df1['ID']).cummax().iloc[::-1])
0 False
1 False
2 False
3 True
4 True
5 False
6 True
7 True
8 False
9 False
10 True
Name: direction, dtype: bool
And last filter by boolean indexing.
CodePudding user response:
IIUC, you can compute a mask using a reverse cummax
per group on the 'OUT' values (reversing using the [::-1]
slice) and use it for boolean indexing:
mask = (df['direction'].eq('OUT')
.groupby(df['ID']).apply(lambda s: s[::-1].cummax()[::-1])
)
df.loc[~mask]
alternatively, get the 'OUT' values only and bfill
per group to select all the rows prior to an OUT, then use the mask as previously:
s = df['direction'].eq('OUT')
mask = s.where(s).groupby(df['ID']).bfill().fillna(False)
df[~mask]
output:
ID direction other color date
3 0 IN PO red 2022-01-05
4 0 IN PT red 2022-01-06
6 2 IN PO green 2022-01-06
7 2 IN KO green 2022-01-07
10 4 IN MY white 2022-01-01
intermediate mask:
df.assign(mask=mask)
ID direction other color date mask
0 0 OUT TT red 2022-01-03 True
1 0 IN TE red 2022-01-03 True
2 0 OUT YE red 2022-01-04 True
3 0 IN PO red 2022-01-05 False
4 0 IN PT red 2022-01-06 False
5 1 OUT PO blue 2022-01-04 True
6 2 IN PO green 2022-01-06 False
7 2 IN KO green 2022-01-07 False
8 3 IN XE red 2022-01-04 True
9 3 OUT PA red 2022-01-05 True
10 4 IN MY white 2022-01-01 False