Create a dataframe in pandas from the rows following a given record for each ID-CodePudding

Given the following Python DataFrame:

ID	direction	other	color	date
0	OUT	TT	red	2022-01-03
0	IN	TE	red	2022-01-03
0	OUT	YE	red	2022-01-04
0	IN	PO	red	2022-01-05
0	IN	PT	red	2022-01-06
1	OUT	PO	blue	2022-01-04
2	IN	PO	green	2022-01-06
2	IN	KO	green	2022-01-07
3	IN	XE	red	2022-01-04
3	OUT	PA	red	2022-01-05
4	IN	MY	white	2022-01-01

I want to get a DataFrame from the previous one, such that it gets for each ID, all the rows following the last direction=OUT row recorded. If only rows with direction=IN exist for an ID, it will also return it in the new DataFrame, even if there is no previous direction=OUT row for that ID.

Result from the above example:

ID	direction	other	color	date
0	IN	PO	red	2022-01-05
0	IN	PT	red	2022-01-06
2	IN	PO	green	2022-01-06
2	IN	KO	green	2022-01-07
4	IN	MY	white	2022-01-01

Thank you in advance for your help.

CodePudding user response：

You can swap order in original DataFrame by slice [::-1], then use GroupBy.cummax with condition direction='OUT', change order by original and pass to boolean indexing with invert mask by ~:

df1 = df.iloc[::-1]
df = df[~df1['direction'].eq('OUT').groupby(df1['ID']).cummax().iloc[::-1]]
print (df)
    ID direction other  color        date
3    0        IN    PO    red  2022-01-05
4    0        IN    PT    red  2022-01-06
6    2        IN    PO  green  2022-01-06
7    2        IN    KO  green  2022-01-07
10   4        IN    MY  white  2022-01-01

How it working:

Swapping rows:

print (df.iloc[::-1])
    ID direction other  color        date
10   4        IN    MY  white  2022-01-01
9    3       OUT    PA    red  2022-01-05
8    3        IN    XE    red  2022-01-04
7    2        IN    KO  green  2022-01-07
6    2        IN    PO  green  2022-01-06
5    1       OUT    PO   blue  2022-01-04
4    0        IN    PT    red  2022-01-06
3    0        IN    PO    red  2022-01-05
2    0       OUT    YE    red  2022-01-04
1    0        IN    TE    red  2022-01-03
0    0       OUT    TT    red  2022-01-03

Get cumulative max:

print (df1['direction'].eq('OUT').groupby(df1['ID']).cummax())
10    False
9      True
8      True
7     False
6     False
5      True
4     False
3     False
2      True
1      True
0      True
Name: direction, dtype: bool

Inverse mask and swap order of values:

print (~df1['direction'].eq('OUT').groupby(df1['ID']).cummax().iloc[::-1])
0     False
1     False
2     False
3      True
4      True
5     False
6      True
7      True
8     False
9     False
10     True
Name: direction, dtype: bool

And last filter by boolean indexing.

CodePudding user response：

IIUC, you can compute a mask using a reverse cummax per group on the 'OUT' values (reversing using the [::-1] slice) and use it for boolean indexing:

mask = (df['direction'].eq('OUT')
        .groupby(df['ID']).apply(lambda s: s[::-1].cummax()[::-1])
       )

df.loc[~mask]

alternatively, get the 'OUT' values only and bfill per group to select all the rows prior to an OUT, then use the mask as previously:

s = df['direction'].eq('OUT')
mask = s.where(s).groupby(df['ID']).bfill().fillna(False)
df[~mask]

output:

    ID direction other  color        date
3    0        IN    PO    red  2022-01-05
4    0        IN    PT    red  2022-01-06
6    2        IN    PO  green  2022-01-06
7    2        IN    KO  green  2022-01-07
10   4        IN    MY  white  2022-01-01

intermediate mask:

df.assign(mask=mask)

    ID direction other  color        date   mask
0    0       OUT    TT    red  2022-01-03   True
1    0        IN    TE    red  2022-01-03   True
2    0       OUT    YE    red  2022-01-04   True
3    0        IN    PO    red  2022-01-05  False
4    0        IN    PT    red  2022-01-06  False
5    1       OUT    PO   blue  2022-01-04   True
6    2        IN    PO  green  2022-01-06  False
7    2        IN    KO  green  2022-01-07  False
8    3        IN    XE    red  2022-01-04   True
9    3       OUT    PA    red  2022-01-05   True
10   4        IN    MY  white  2022-01-01  False