Home > Blockchain >  dataframe get rows between certain value to a certain value in previous row
dataframe get rows between certain value to a certain value in previous row

Time:10-24

I have a dataframe:

df = c1 c2 c3 code
     1. 2. 3. 200
     1. 5. 7. 220
     1. 2. 3. 200
     2. 4. 1. 340
     6. 1. 1. 370
     6. 1. 5. 270
     9. 8. 2. 300
     1. 6. 9. 700
     9. 2. 1. 200 
     8. 1. 2  400
     1. 2  1. 200
     2. 5. 3  900
     8. 0. 4. 300
     9. 1. 2. 620

I want to take only the rows that are between any row with 300 code to its previous 200 code. So here I will have

df.  c1 c2 c3 code batch_num
     1. 2. 3. 200.   0
     2. 4. 1. 340.   0
     6. 1. 1. 370.   0 
     6. 1. 5. 270.   0
     9. 8. 2. 300.   0
     1. 2  1. 200.   1
     2. 5. 3  900.   1
     8. 0. 4. 300.   1

So basically what I need is to: find each 300, and for each - find the nearest previous 200, and take the rows between them. It is guaranteed that there will always be at least one 200 before each 300. Than, add a columns that indicate the proper batch. How can I do it efficiently in pandas?

CodePudding user response:

You can use:

# rows after 200 are True
m1 = df['code'].map({200: True, 300: False}).ffill()
# rows before 300 are True
m2 = df['code'].map({300: True, 200: False}).bfill()

# if both conditions True, expand to ± 1
m = (m1&m2).rolling(3, min_periods=1, center=True).max().astype(bool)

# select
out = df[m]

# add batch
out['batch_num'] = out['code'].eq(200).cumsum().sub(1)

output:

     c1   c2   c3  code  batch_num
2   1.0  2.0  3.0   200          0
3   2.0  4.0  1.0   340          0
4   9.0  8.0  2.0   300          0
8   1.0  2.0  1.0   200          1
9   2.0  5.0  3.0   900          1
10  8.0  0.0  4.0   300          1

Intermediates:

     c1   c2   c3  code     m1     m2  m1&m2  rolling_max
0   1.0  2.0  3.0   200   True  False  False        False
1   1.0  5.0  7.0   220   True  False  False        False
2   1.0  2.0  3.0   200   True  False  False         True
3   2.0  4.0  1.0   340   True   True   True         True
4   9.0  8.0  2.0   300  False   True  False         True
5   1.0  6.0  9.0   700  False  False  False        False
6   9.0  2.0  1.0   200   True  False  False        False
7   8.0  1.0  2.0   400   True  False  False        False
8   1.0  2.0  1.0   200   True  False  False         True
9   2.0  5.0  3.0   900   True   True   True         True
10  8.0  0.0  4.0   300  False   True  False         True
11  9.0  1.0  2.0   620  False    NaN  False        False

CodePudding user response:

Idea is filter only rows matched both values by Series.isin:

#create default index
df = df.reset_index(drop=True)

v = [200, 300]

df1 = df[df['code'].isin(v)]

Then get all indices with consecutive 200, 300 pairs (by list):

m1 = df1['code'].eq(v[0]) & df1['code'].shift(-1).eq(v[1])
m2 = df1['code'].eq(v[1]) & df1['code'].shift().eq(v[0])

idx = df1.index[m1 | m2]

print (df1[m1 | m2])
     c1   c2   c3  code
2   1.0  2.0  3.0   200
6   9.0  8.0  2.0   300
10  1.0  2.0  1.0   200
12  8.0  0.0  4.0   300

Last add missing indices between pairs and select original DataFrame and batch_num column by compare by first value 200 with cumulative sum, last subtract 1:

df2 = df.loc[[x for a, b in zip(idx[::2],idx[1::2]) for x in range(a, b   1)]]

df2['batch_num'] = df2['code'].eq(v[0]).cumsum().sub(1)
print (df2)
     c1   c2   c3  code  batch_num
2   1.0  2.0  3.0   200          0
3   2.0  4.0  1.0   340          0
4   6.0  1.0  1.0   370          0
5   6.0  1.0  5.0   270          0
6   9.0  8.0  2.0   300          0
10  1.0  2.0  1.0   200          1
11  2.0  5.0  3.0   900          1
12  8.0  0.0  4.0   300          1
  • Related