Home > Blockchain >  Create column that groups by and analyze window of values from another column
Create column that groups by and analyze window of values from another column

Time:11-20

Need to group by ColA, and use a 2/3 day window in ColD to analyze those values to create 'NewCol'. Example and more information below:

 ColA     ColB         ColC     ColD
  B      2021-10-24    2         NA
  B      2021-10-25    4          2
  B      2021-10-26    500       496
  B      2021-10-27    100      -400
  B      2021-10-28    55       -45
  B      2021-10-29    600       545
  B      2021-10-30    8        -592 
  B      2021-10-31    4300     4292 
  B      2021-11-01    200     -4100
  H      2021-10-24    600       NA
  H      2021-10-25   10000     9400 
  H      2021-10-26    100     -9900 
  H      2021-10-27    300       200
  H      2021-10-28    2       -292
  H      2021-10-29    8         6
  H      2021-10-30    600      592
  H      2021-10-31    600      0
  H      2021-11-01    650      50

Goal is to create 'NewCol', which essentially looks at ColD and associated values in it's group (ColA) from 2&3 days ago - if the value is 200 or greater, flag it with 1. If ColD value < 200 then assign 0.

Example - group B on 2021-10-31 - will look at ColD values from 2&3 days ago - thus 545 and -45, respectively - since one of those values is >200, assign NewCol with 1.

 ColA     ColB         ColC     ColD     NewCol
  B      2021-10-24    2         NA        NA
  B      2021-10-25    4          2        NA
  B      2021-10-26    500       496       NA 
  B      2021-10-27    100      -400        0
  B      2021-10-28    55       -45         1
  B      2021-10-29    600       545        1
  B      2021-10-30    8        -592        0
  B      2021-10-31    4300     4292        1
  B      2021-11-01    200     -4100        1
  H      2021-10-24    600       NA        NA
  H      2021-10-25   10000     9400       NA
  H      2021-10-26    100     -9900       NA
  H      2021-10-27    300       200        1
  H      2021-10-28    2       -292         1
  H      2021-10-29    8         6          1
  H      2021-10-30    600      592         1
  H      2021-10-31    600      0           0
  H      2021-11-01    650      50          1
  

Any suggestions is appreciated!

CodePudding user response:

With all the dates being consecutive and ordered per group, you can first compare the value in ColD to 200, and groupby the ColA. then use this grouped object with shift twice. Check if any (with |) of the two shifted values is True and cast to int

gr = df['ColD'].ge(200).groupby(df['ColA'])
df['newCol'] = (gr.shift(2)|gr.shift(3)).astype(int)
print(df)
#    ColA       ColB   ColC    ColD  newCol
# 0     B 2021-10-24      2     NaN       0
# 1     B 2021-10-25      4     2.0       0
# 2     B 2021-10-26    500   496.0       0
# 3     B 2021-10-27    100  -400.0       0
# 4     B 2021-10-28     55   -45.0       1
# 5     B 2021-10-29    600   545.0       1
# 6     B 2021-10-30      8  -592.0       0
# 7     B 2021-10-31   4300  4292.0       1
# 8     B 2021-11-01    200 -4100.0       1
# 9     H 2021-10-24    600     NaN       0
# 10    H 2021-10-25  10000  9400.0       0
# 11    H 2021-10-26    100 -9900.0       0
# 12    H 2021-10-27    300   200.0       1
# 13    H 2021-10-28      2  -292.0       1
# 14    H 2021-10-29      8     6.0       1
# 15    H 2021-10-30    600   592.0       1
# 16    H 2021-10-31    600     0.0       0
# 17    H 2021-11-01    650    50.0       1

CodePudding user response:

txt=""" ColA,ColB,ColC,ColD
B,2021-10-24,2,NA
B,2021-10-25,4,2
B,2021-10-26,500,496
B,2021-10-27,100,-400
B,2021-10-28,55,-45
B,2021-10-29,600,545
B,2021-10-30,8,-592 
B,2021-10-31,4300,4292 
B,2021-11-01,200,-4100
H,2021-10-24,600,NA
H,2021-10-25,0000,9400 
H,2021-10-26,100,-9900 
H,2021-10-27,300,200
H,2021-10-28,2,-292
H,2021-10-29,8,6
H,2021-10-30,600,592
H,2021-10-31,600,0
H,2021-11-01,650,50"""

df = pd.read_csv(io.StringIO(txt),sep=',',parse_dates=['ColB'])
df['ColD_2']=df['ColD'].shift(2)
df['ColD_3']=df['ColD'].shift(3)
df['ColD_2_check']=np.where(df['ColD_2']>200,1,0)
df['ColD_3_check']=np.where(df['ColD_3']>200,1,0)
df['newCol']=df['ColD_2_check']|df['ColD_3_check']
df.drop(['ColD_2','ColD_3','ColD_2_check','ColD_3_check'],inplace=True,axis=1)
print(df)

output

ColA       ColB  ColC    ColD  newCol
0      B 2021-10-24     2     NaN       0
1      B 2021-10-25     4     2.0       0
2      B 2021-10-26   500   496.0       0
3      B 2021-10-27   100  -400.0       0
4      B 2021-10-28    55   -45.0       1
5      B 2021-10-29   600   545.0       1
6      B 2021-10-30     8  -592.0       0
7      B 2021-10-31  4300  4292.0       1
8      B 2021-11-01   200 -4100.0       1
9      H 2021-10-24   600     NaN       1
10     H 2021-10-25     0  9400.0       1
11     H 2021-10-26   100 -9900.0       0
12     H 2021-10-27   300   200.0       1
13     H 2021-10-28     2  -292.0       1
14     H 2021-10-29     8     6.0       0
15     H 2021-10-30   600   592.0       0
16     H 2021-10-31   600     0.0       0
17     H 2021-11-01   650    50.0       1

  • Related