I have a dataframe:
df = c1 c2 c3 code
1. 2. 3. 200
1. 5. 7. 220
1. 2. 3. 200
2. 4. 1. 340
6. 1. 1. 370
6. 1. 5. 270
9. 8. 2. 300
1. 6. 9. 700
9. 2. 1. 200
8. 1. 2 400
1. 2 1. 200
2. 5. 3 900
8. 0. 4. 300
9. 1. 2. 620
I want to take only the rows that are between any row with 300 code to its previous 200 code. So here I will have
df. c1 c2 c3 code batch_num
1. 2. 3. 200. 0
2. 4. 1. 340. 0
6. 1. 1. 370. 0
6. 1. 5. 270. 0
9. 8. 2. 300. 0
1. 2 1. 200. 1
2. 5. 3 900. 1
8. 0. 4. 300. 1
So basically what I need is to: find each 300, and for each - find the nearest previous 200, and take the rows between them. It is guaranteed that there will always be at least one 200 before each 300. Than, add a columns that indicate the proper batch. How can I do it efficiently in pandas?
CodePudding user response:
You can use:
# rows after 200 are True
m1 = df['code'].map({200: True, 300: False}).ffill()
# rows before 300 are True
m2 = df['code'].map({300: True, 200: False}).bfill()
# if both conditions True, expand to ± 1
m = (m1&m2).rolling(3, min_periods=1, center=True).max().astype(bool)
# select
out = df[m]
# add batch
out['batch_num'] = out['code'].eq(200).cumsum().sub(1)
output:
c1 c2 c3 code batch_num
2 1.0 2.0 3.0 200 0
3 2.0 4.0 1.0 340 0
4 9.0 8.0 2.0 300 0
8 1.0 2.0 1.0 200 1
9 2.0 5.0 3.0 900 1
10 8.0 0.0 4.0 300 1
Intermediates:
c1 c2 c3 code m1 m2 m1&m2 rolling_max
0 1.0 2.0 3.0 200 True False False False
1 1.0 5.0 7.0 220 True False False False
2 1.0 2.0 3.0 200 True False False True
3 2.0 4.0 1.0 340 True True True True
4 9.0 8.0 2.0 300 False True False True
5 1.0 6.0 9.0 700 False False False False
6 9.0 2.0 1.0 200 True False False False
7 8.0 1.0 2.0 400 True False False False
8 1.0 2.0 1.0 200 True False False True
9 2.0 5.0 3.0 900 True True True True
10 8.0 0.0 4.0 300 False True False True
11 9.0 1.0 2.0 620 False NaN False False
CodePudding user response:
Idea is filter only rows matched both values by Series.isin
:
#create default index
df = df.reset_index(drop=True)
v = [200, 300]
df1 = df[df['code'].isin(v)]
Then get all indices with consecutive 200, 300 pairs (by list):
m1 = df1['code'].eq(v[0]) & df1['code'].shift(-1).eq(v[1])
m2 = df1['code'].eq(v[1]) & df1['code'].shift().eq(v[0])
idx = df1.index[m1 | m2]
print (df1[m1 | m2])
c1 c2 c3 code
2 1.0 2.0 3.0 200
6 9.0 8.0 2.0 300
10 1.0 2.0 1.0 200
12 8.0 0.0 4.0 300
Last add missing indices between pairs and select original DataFrame and batch_num
column by compare by first value 200
with cumulative sum, last subtract 1
:
df2 = df.loc[[x for a, b in zip(idx[::2],idx[1::2]) for x in range(a, b 1)]]
df2['batch_num'] = df2['code'].eq(v[0]).cumsum().sub(1)
print (df2)
c1 c2 c3 code batch_num
2 1.0 2.0 3.0 200 0
3 2.0 4.0 1.0 340 0
4 6.0 1.0 1.0 370 0
5 6.0 1.0 5.0 270 0
6 9.0 8.0 2.0 300 0
10 1.0 2.0 1.0 200 1
11 2.0 5.0 3.0 900 1
12 8.0 0.0 4.0 300 1