Existing Dataframe :
Id action date value
A enter 20/12/2021 0
A enter 20/12/2021 150
A L-1 20/12/2021 520
A L-2 20/12/2021 50
A L-2 20/12/2021 550
A L-3 20/12/2021 20
A L-4 20/12/2021 5
A L-5 20/12/2021 1
B enter 25/12/2021 2
B L-1 25/12/2021 510
B L-2 25/12/2021 6
B L-3 25/12/2021 3
C enter 26/12/2021 4
C L-1 26/12/2021 10
C L-2 26/12/2021 20
Expected Dataframe :
Id action date value
A L-2 20/12/2021 550
A L-3 20/12/2021 20
A L-4 20/12/2021 5
A L-5 20/12/2021 1
B L-1 25/12/2021 510
B L-2 25/12/2021 6
B L-3 25/12/2021 3
C enter 26/12/2021 4
C L-1 26/12/2021 10
C L-2 26/12/2021 20
looking to filter out the rows before the last appearance of value > 500. i.e. if the value is greater than 500 and no other value after that is greater than 500 then only those rows need to kept that appears after the last value > 500(including this row as well)
This code solves the purpose , but does not include the row where the condition is meet
g = df.loc[::-1, 'value'].le(500).groupby(df['Id'])
m1 = g.cummin() # drop rows with >500 and before
m2 = ~g.transform('all') # drop groups without value >500
out = df.loc[m1&m2]
CodePudding user response:
Use GroupBy.transform
with Series.shift
for include values for last values greater like 500
:
g = df.loc[::-1, 'value'].le(500).groupby(df['Id'])
m1 = g.transform(lambda x: x.shift(fill_value=True).cummin())
out = df[m1]
print (out)
Id action date value
4 A L-2 20/12/2021 550
5 A L-3 20/12/2021 20
6 A L-4 20/12/2021 5
7 A L-5 20/12/2021 1
9 B L-1 25/12/2021 510
10 B L-2 25/12/2021 6
11 B L-3 25/12/2021 3
12 C enter 26/12/2021 4
13 C L-1 26/12/2021 10
14 C L-2 26/12/2021 20
If need remove C
group because no exist values greater like 500
chain another mask:
m2 = ~g.transform('all')
out = df.loc[m1&m2]
print (out)
Id action date value
4 A L-2 20/12/2021 550
5 A L-3 20/12/2021 20
6 A L-4 20/12/2021 5
7 A L-5 20/12/2021 1
9 B L-1 25/12/2021 510
10 B L-2 25/12/2021 6
11 B L-3 25/12/2021 3