Need to group by ColA, and use a 2/3 day window in ColD to analyze those values to create 'NewCol'. Example and more information below:
ColA ColB ColC ColD
B 2021-10-24 2 NA
B 2021-10-25 4 2
B 2021-10-26 500 496
B 2021-10-27 100 -400
B 2021-10-28 55 -45
B 2021-10-29 600 545
B 2021-10-30 8 -592
B 2021-10-31 4300 4292
B 2021-11-01 200 -4100
H 2021-10-24 600 NA
H 2021-10-25 10000 9400
H 2021-10-26 100 -9900
H 2021-10-27 300 200
H 2021-10-28 2 -292
H 2021-10-29 8 6
H 2021-10-30 600 592
H 2021-10-31 600 0
H 2021-11-01 650 50
Goal is to create 'NewCol', which essentially looks at ColD and associated values in it's group (ColA) from 2&3 days ago - if the value is 200 or greater, flag it with 1. If ColD value < 200 then assign 0.
Example - group B on 2021-10-31 - will look at ColD values from 2&3 days ago - thus 545 and -45, respectively - since one of those values is >200, assign NewCol with 1.
ColA ColB ColC ColD NewCol
B 2021-10-24 2 NA NA
B 2021-10-25 4 2 NA
B 2021-10-26 500 496 NA
B 2021-10-27 100 -400 0
B 2021-10-28 55 -45 1
B 2021-10-29 600 545 1
B 2021-10-30 8 -592 0
B 2021-10-31 4300 4292 1
B 2021-11-01 200 -4100 1
H 2021-10-24 600 NA NA
H 2021-10-25 10000 9400 NA
H 2021-10-26 100 -9900 NA
H 2021-10-27 300 200 1
H 2021-10-28 2 -292 1
H 2021-10-29 8 6 1
H 2021-10-30 600 592 1
H 2021-10-31 600 0 0
H 2021-11-01 650 50 1
Any suggestions is appreciated!
CodePudding user response:
With all the dates being consecutive and ordered per group, you can first compare the value in ColD to 200, and groupby
the ColA. then use this grouped object with shift
twice. Check if any (with |
) of the two shifted values is True
and cast to int
gr = df['ColD'].ge(200).groupby(df['ColA'])
df['newCol'] = (gr.shift(2)|gr.shift(3)).astype(int)
print(df)
# ColA ColB ColC ColD newCol
# 0 B 2021-10-24 2 NaN 0
# 1 B 2021-10-25 4 2.0 0
# 2 B 2021-10-26 500 496.0 0
# 3 B 2021-10-27 100 -400.0 0
# 4 B 2021-10-28 55 -45.0 1
# 5 B 2021-10-29 600 545.0 1
# 6 B 2021-10-30 8 -592.0 0
# 7 B 2021-10-31 4300 4292.0 1
# 8 B 2021-11-01 200 -4100.0 1
# 9 H 2021-10-24 600 NaN 0
# 10 H 2021-10-25 10000 9400.0 0
# 11 H 2021-10-26 100 -9900.0 0
# 12 H 2021-10-27 300 200.0 1
# 13 H 2021-10-28 2 -292.0 1
# 14 H 2021-10-29 8 6.0 1
# 15 H 2021-10-30 600 592.0 1
# 16 H 2021-10-31 600 0.0 0
# 17 H 2021-11-01 650 50.0 1
CodePudding user response:
txt=""" ColA,ColB,ColC,ColD
B,2021-10-24,2,NA
B,2021-10-25,4,2
B,2021-10-26,500,496
B,2021-10-27,100,-400
B,2021-10-28,55,-45
B,2021-10-29,600,545
B,2021-10-30,8,-592
B,2021-10-31,4300,4292
B,2021-11-01,200,-4100
H,2021-10-24,600,NA
H,2021-10-25,0000,9400
H,2021-10-26,100,-9900
H,2021-10-27,300,200
H,2021-10-28,2,-292
H,2021-10-29,8,6
H,2021-10-30,600,592
H,2021-10-31,600,0
H,2021-11-01,650,50"""
df = pd.read_csv(io.StringIO(txt),sep=',',parse_dates=['ColB'])
df['ColD_2']=df['ColD'].shift(2)
df['ColD_3']=df['ColD'].shift(3)
df['ColD_2_check']=np.where(df['ColD_2']>200,1,0)
df['ColD_3_check']=np.where(df['ColD_3']>200,1,0)
df['newCol']=df['ColD_2_check']|df['ColD_3_check']
df.drop(['ColD_2','ColD_3','ColD_2_check','ColD_3_check'],inplace=True,axis=1)
print(df)
output
ColA ColB ColC ColD newCol
0 B 2021-10-24 2 NaN 0
1 B 2021-10-25 4 2.0 0
2 B 2021-10-26 500 496.0 0
3 B 2021-10-27 100 -400.0 0
4 B 2021-10-28 55 -45.0 1
5 B 2021-10-29 600 545.0 1
6 B 2021-10-30 8 -592.0 0
7 B 2021-10-31 4300 4292.0 1
8 B 2021-11-01 200 -4100.0 1
9 H 2021-10-24 600 NaN 1
10 H 2021-10-25 0 9400.0 1
11 H 2021-10-26 100 -9900.0 0
12 H 2021-10-27 300 200.0 1
13 H 2021-10-28 2 -292.0 1
14 H 2021-10-29 8 6.0 0
15 H 2021-10-30 600 592.0 0
16 H 2021-10-31 600 0.0 0
17 H 2021-11-01 650 50.0 1