Home > Back-end >  Create a dataframe in pandas from the rows following a given record for each ID
Create a dataframe in pandas from the rows following a given record for each ID

Time:05-04

Given the following Python DataFrame:

ID direction other color date
0 OUT TT red 2022-01-03
0 IN TE red 2022-01-03
0 OUT YE red 2022-01-04
0 IN PO red 2022-01-05
0 IN PT red 2022-01-06
1 OUT PO blue 2022-01-04
2 IN PO green 2022-01-06
2 IN KO green 2022-01-07
3 IN XE red 2022-01-04
3 OUT PA red 2022-01-05
4 IN MY white 2022-01-01

I want to get a DataFrame from the previous one, such that it gets for each ID, all the rows following the last direction=OUT row recorded. If only rows with direction=IN exist for an ID, it will also return it in the new DataFrame, even if there is no previous direction=OUT row for that ID.

Result from the above example:

ID direction other color date
0 IN PO red 2022-01-05
0 IN PT red 2022-01-06
2 IN PO green 2022-01-06
2 IN KO green 2022-01-07
4 IN MY white 2022-01-01

Thank you in advance for your help.

CodePudding user response:

You can swap order in original DataFrame by slice [::-1], then use GroupBy.cummax with condition direction='OUT', change order by original and pass to boolean indexing with invert mask by ~:

df1 = df.iloc[::-1]
df = df[~df1['direction'].eq('OUT').groupby(df1['ID']).cummax().iloc[::-1]]
print (df)
    ID direction other  color        date
3    0        IN    PO    red  2022-01-05
4    0        IN    PT    red  2022-01-06
6    2        IN    PO  green  2022-01-06
7    2        IN    KO  green  2022-01-07
10   4        IN    MY  white  2022-01-01

How it working:

Swapping rows:

print (df.iloc[::-1])
    ID direction other  color        date
10   4        IN    MY  white  2022-01-01
9    3       OUT    PA    red  2022-01-05
8    3        IN    XE    red  2022-01-04
7    2        IN    KO  green  2022-01-07
6    2        IN    PO  green  2022-01-06
5    1       OUT    PO   blue  2022-01-04
4    0        IN    PT    red  2022-01-06
3    0        IN    PO    red  2022-01-05
2    0       OUT    YE    red  2022-01-04
1    0        IN    TE    red  2022-01-03
0    0       OUT    TT    red  2022-01-03

Get cumulative max:

print (df1['direction'].eq('OUT').groupby(df1['ID']).cummax())
10    False
9      True
8      True
7     False
6     False
5      True
4     False
3     False
2      True
1      True
0      True
Name: direction, dtype: bool

Inverse mask and swap order of values:

print (~df1['direction'].eq('OUT').groupby(df1['ID']).cummax().iloc[::-1])
0     False
1     False
2     False
3      True
4      True
5     False
6      True
7      True
8     False
9     False
10     True
Name: direction, dtype: bool

And last filter by boolean indexing.

CodePudding user response:

IIUC, you can compute a mask using a reverse cummax per group on the 'OUT' values (reversing using the [::-1] slice) and use it for boolean indexing:

mask = (df['direction'].eq('OUT')
        .groupby(df['ID']).apply(lambda s: s[::-1].cummax()[::-1])
       )

df.loc[~mask]

alternatively, get the 'OUT' values only and bfill per group to select all the rows prior to an OUT, then use the mask as previously:

s = df['direction'].eq('OUT')
mask = s.where(s).groupby(df['ID']).bfill().fillna(False)
df[~mask]

output:

    ID direction other  color        date
3    0        IN    PO    red  2022-01-05
4    0        IN    PT    red  2022-01-06
6    2        IN    PO  green  2022-01-06
7    2        IN    KO  green  2022-01-07
10   4        IN    MY  white  2022-01-01

intermediate mask:

df.assign(mask=mask)

    ID direction other  color        date   mask
0    0       OUT    TT    red  2022-01-03   True
1    0        IN    TE    red  2022-01-03   True
2    0       OUT    YE    red  2022-01-04   True
3    0        IN    PO    red  2022-01-05  False
4    0        IN    PT    red  2022-01-06  False
5    1       OUT    PO   blue  2022-01-04   True
6    2        IN    PO  green  2022-01-06  False
7    2        IN    KO  green  2022-01-07  False
8    3        IN    XE    red  2022-01-04   True
9    3       OUT    PA    red  2022-01-05   True
10   4        IN    MY  white  2022-01-01  False
  • Related