Home > Enterprise >  filter and get rows between the conditions in a dataframe
filter and get rows between the conditions in a dataframe

Time:09-19

My DataFrame looks something like this:


 ---------------------------------- --------- 
|               Col1               |  Col2   |
 ---------------------------------- --------- 
| Start A                          |    1    |
| value 1                          |    2    |
| value 2                          |    3    |
| value 3                          |    4    |
| value 5                          |    5    |
| End A                            |    6    |
| value 6                          |    3    |
| value 7                          |    4    |
| value 8                          |    5    |
| Start B                          |    1    |
| value 1                          |    2    |
| value 2                          |    3    |
| value 3                          |    4    |
| value 5                          |    5    |
| End B                            |    6    |
| value 6                          |    3    |
| value 7                          |    4    |
| value 8                          |    5    |
| Start C                          |    1    |
| value 1                          |    2    |
| value 2                          |    3    |
| value 3                          |    4    |
| value 5                          |    5    |
| End C                            |    6    |
 ---------------------------------- --------- 

What I am trying to acheive is if substring start and end is present I want the rows between them. Expected Result is:


 ---------------------------------- --------- 
|               Col1               |  Col2   |
 ---------------------------------- --------- 
| Start A                          |    1    |
| value 1                          |    2    |
| value 2                          |    3    |
| value 3                          |    4    |
| value 5                          |    5    |
| End A                            |    6    |
| Start B                          |    1    |
| value 1                          |    2    |
| value 2                          |    3    |
| value 3                          |    4    |
| value 5                          |    5    |
| End B                            |    6    |
| Start C                          |    1    |
| value 1                          |    2    |
| value 2                          |    3    |
| value 3                          |    4    |
| value 5                          |    5    |
| End C                            |    6    |
 ---------------------------------- --------- 

I tried the code from this How to filter dataframe columns between two rows that contain specific string in column?

m = df['To'].isin(['Start A', 'End A']).cumsum().eq(1)
df[m|m.shift()]

But this only returns the first set of start and end, also it expects the exact string.

output:

 ---------------------------------- --------- 
|               Col1               |  Col2   |
 ---------------------------------- --------- 
| Start A                          |    1    |
| value 1                          |    2    |
| value 2                          |    3    |
| value 3                          |    4    |
| value 5                          |    5    |
| End A                            |    6    |
 ---------------------------------- --------- 

CodePudding user response:

Let's try:

# extract the label after `Start/End`
groups = df['Col1'].str.extract('[Start|End] (.*)', expand=False)

# keep rows with equal forward fill and backward fill 
df[groups.bfill() == groups.ffill()]

Output:

       Col1  Col2
0   Start A     1
1   value 1     2
2   value 2     3
3   value 3     4
4   value 5     5
5     End A     6
9   Start B     1
10  value 1     2
11  value 2     3
12  value 3     4
13  value 5     5
14    End B     6
18  Start C     1
19  value 1     2
20  value 2     3
21  value 3     4
22  value 5     5
23    End C     6

CodePudding user response:

The answer you linked to was designed to work with a single pair of Start/End.

A more generic variant of it would be to check for the parity of the group (assuming strictly alternating Start/End):

m = df['Col1'].str.match(r'Start|End').cumsum().mod(2).eq(1)

# boolean indexing
out = df[m1|m1.shift()]

Alternatively, use each Start as a flag to keep the following rows and each End as a flag to drop them. This wouldn't however consider the A/B/C letter after the Start/End like the nice answer of @Quang does:

# extract Start/End
s = df['Col1'].str.extract(r'^(Start|End)', expand=False)

# set flags and ffill
m1 = s.replace({'Start': True, 'End': False}).ffill()

# boolean slicing
out = df[m1|m1.shift()]

Output:

       Col1  Col2
0   Start A     1
1   value 1     2
2   value 2     3
3   value 3     4
4   value 5     5
5     End A     6
9   Start B     1
10  value 1     2
11  value 2     3
12  value 3     4
13  value 5     5
14    End B     6
18  Start C     1
19  value 1     2
20  value 2     3
21  value 3     4
22  value 5     5
23    End C     6
  • Related