Home > Back-end >  dataframe filtering with specific condition
dataframe filtering with specific condition

Time:10-15

Existing Dataframe :

Id      action          date          value

A       enter          20/12/2021       0
A       enter          20/12/2021      150
A       L-1            20/12/2021      520
A       L-2            20/12/2021      50
A       L-2            20/12/2021      550
A       L-3            20/12/2021      20
A       L-4            20/12/2021      5
A       L-5            20/12/2021      1
B       enter          25/12/2021      2
B       L-1            25/12/2021      510
B       L-2            25/12/2021      6
B       L-3            25/12/2021      3
C       enter          26/12/2021      4
C       L-1            26/12/2021      10
C       L-2            26/12/2021      20

Expected Dataframe :

Id      action          date          value

A       L-2            20/12/2021      550
A       L-3            20/12/2021      20
A       L-4            20/12/2021      5
A       L-5            20/12/2021      1
B       L-1            25/12/2021      510
B       L-2            25/12/2021      6
B       L-3            25/12/2021      3
C       enter          26/12/2021      4
C       L-1            26/12/2021      10
C       L-2            26/12/2021      20

looking to filter out the rows before the last appearance of value > 500. i.e. if the value is greater than 500 and no other value after that is greater than 500 then only those rows need to kept that appears after the last value > 500(including this row as well)

This code solves the purpose , but does not include the row where the condition is meet

g = df.loc[::-1, 'value'].le(500).groupby(df['Id'])

m1 = g.cummin()           # drop rows with >500 and before
m2 = ~g.transform('all')  # drop groups without value >500

out = df.loc[m1&m2]

CodePudding user response:

Use GroupBy.transform with Series.shift for include values for last values greater like 500:

g = df.loc[::-1, 'value'].le(500).groupby(df['Id'])

m1 = g.transform(lambda x: x.shift(fill_value=True).cummin())  

out = df[m1]
print (out)
   Id action        date  value
4   A    L-2  20/12/2021    550
5   A    L-3  20/12/2021     20
6   A    L-4  20/12/2021      5
7   A    L-5  20/12/2021      1
9   B    L-1  25/12/2021    510
10  B    L-2  25/12/2021      6
11  B    L-3  25/12/2021      3
12  C  enter  26/12/2021      4
13  C    L-1  26/12/2021     10
14  C    L-2  26/12/2021     20

If need remove C group because no exist values greater like 500 chain another mask:

m2 = ~g.transform('all')
out = df.loc[m1&m2]
print (out)
   Id action        date  value
4   A    L-2  20/12/2021    550
5   A    L-3  20/12/2021     20
6   A    L-4  20/12/2021      5
7   A    L-5  20/12/2021      1
9   B    L-1  25/12/2021    510
10  B    L-2  25/12/2021      6
11  B    L-3  25/12/2021      3
  • Related