Home > Mobile >  Pandas: Creating indicator column after condition
Pandas: Creating indicator column after condition

Time:04-28

import numpy as np
import pandas as pd
df = pd.DataFrame({
   'cond': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B','B', 'B', 'B', 'B', 'B','B','B'],
   'Array':  ['S', 'S', 'TT', 'TT','S', 'S', 'TT', 'TT','S', 'S', 'TT', 'TT','S', 'S', 'TT', 'TT','SS','TT'],
   'Area': [3.0, 2.0, 2.88, 1.33,  2.44, 1.25, 1.53, 1.0, 0.156, 2.0, 2.4, 6.3, 6.9, 9.78, 10.2, 3.0, 16.0, 19.0]
})
print(df)

I am trying to make an indicator column that indicates if the area being a certain size has already happened. So for example, if the cond. is A, then I want to indicate the first time that the area is <=1.5 (and for all data points after), and if the cond. is B, then to indicate the first time the area >10 (and for all points after). The final result should look like:

   cond Array    Area   Indicator
0     A     S   3.000        0
1     A     S   2.000        0
2     A    TT   2.880        0
3     A    TT   1.330        1
4     A     S   2.440        1
5     A     S   1.250        1
6     A    TT   1.530        1
7     A    TT   1.000        1
8     A     S   0.156        1
9     B     S   2.000        0
10    B    TT   2.400        0
11    B    TT   6.300        0
12    B     S   6.900        0
13    B     S   9.780        0
14    B    TT  10.200        1
15    B    TT   3.000        1
16    B    SS  16.000        1
17    B    TT  19.000        1

A lot of the other examples I looked at were to indicate if the area for A was <=1.5, or indicate the first time it happens, but not to indicate the first time it happens AND indicate all the datapoints after. The idea is that once my condition hits a certain area, it enters a different "phase" and I'm trying to indicate when "A" enters and stays in that phase (and the equivalent for B).

CodePudding user response:

You can make write the conditions, and then group by cond and use cumsum clip:

mask = (df['cond'].eq('A') & df['Area'].lt(1.5)) | (df['cond'].eq('B') & df['Area'].gt(10))
df['Indicator'] = mask.groupby(df['cond']).cumsum().clip(0, 1)

Output:

>>> df
   cond Array    Area  Indicator
0   A    S     3.000   0        
1   A    S     2.000   0        
2   A    TT    2.880   0        
3   A    TT    1.330   1        
4   A    S     2.440   1        
5   A    S     1.250   1        
6   A    TT    1.530   1        
7   A    TT    1.000   1        
8   A    S     0.156   1        
9   B    S     2.000   0        
10  B    TT    2.400   0        
11  B    TT    6.300   0        
12  B    S     6.900   0        
13  B    S     9.780   0        
14  B    TT    10.200  1        
15  B    TT    3.000   1        
16  B    SS    16.000  1        
17  B    TT    19.000  1

CodePudding user response:

You could create a boolean Series by comparing the Area values with the cutoff points for each cond. To create the boolean Series, we first have to map the cutoff points to the conds; and since B requires greater than check and A requires less than check; we have to reverse the sign for B to get both conditional checks in the same direction.

Then use groupby.cummax to get the desired indicators:

mapping = {'A':1.5, 'B':-10}
area = df['Area'].mask(df['cond'].eq('B'), -df['Area'])
df['Indicator'] = df['cond'].map(mapping).ge(area).groupby(df['cond']).cummax().astype(int)

Output:

   cond Array    Area  Indicator
0     A     S   3.000          0
1     A     S   2.000          0
2     A    TT   2.880          0
3     A    TT   1.330          1
4     A     S   2.440          1
5     A     S   1.250          1
6     A    TT   1.530          1
7     A    TT   1.000          1
8     A     S   0.156          1
9     B     S   2.000          0
10    B    TT   2.400          0
11    B    TT   6.300          0
12    B     S   6.900          0
13    B     S   9.780          0
14    B    TT  10.200          1
15    B    TT   3.000          1
16    B    SS  16.000          1
17    B    TT  19.000          1

CodePudding user response:

Use expanding to find if any of the previous values match your condition:

condA = df["cond"].eq("A")&df["Area"].expanding().apply(lambda x: x.lt(1.5).any())
condB = df["cond"].eq("B")&df["Area"].expanding().apply(lambda x: x.gt(10).any())
df["Indicator"] = (condA|condB).astype(int)

>>> df
   cond Array    Area  Indicator
0     A     S   3.000          0
1     A     S   2.000          0
2     A    TT   2.880          0
3     A    TT   1.330          1
4     A     S   2.440          1
5     A     S   1.250          1
6     A    TT   1.530          1
7     A    TT   1.000          1
8     A     S   0.156          1
9     B     S   2.000          0
10    B    TT   2.400          0
11    B    TT   6.300          0
12    B     S   6.900          0
13    B     S   9.780          0
14    B    TT  10.200          1
15    B    TT   3.000          1
16    B    SS  16.000          1
17    B    TT  19.000          1
  • Related