My DataFrame looks something like this:
---------------------------------- ---------
| Col1 | Col2 |
---------------------------------- ---------
| Start A | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End A | 6 |
| value 6 | 3 |
| value 7 | 4 |
| value 8 | 5 |
| Start B | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End B | 6 |
| value 6 | 3 |
| value 7 | 4 |
| value 8 | 5 |
| Start C | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End C | 6 |
---------------------------------- ---------
What I am trying to acheive is if substring start and end is present I want the rows between them. Expected Result is:
---------------------------------- ---------
| Col1 | Col2 |
---------------------------------- ---------
| Start A | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End A | 6 |
| Start B | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End B | 6 |
| Start C | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End C | 6 |
---------------------------------- ---------
I tried the code from this How to filter dataframe columns between two rows that contain specific string in column?
m = df['To'].isin(['Start A', 'End A']).cumsum().eq(1)
df[m|m.shift()]
But this only returns the first set of start and end, also it expects the exact string.
output:
---------------------------------- ---------
| Col1 | Col2 |
---------------------------------- ---------
| Start A | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End A | 6 |
---------------------------------- ---------
CodePudding user response:
Let's try:
# extract the label after `Start/End`
groups = df['Col1'].str.extract('[Start|End] (.*)', expand=False)
# keep rows with equal forward fill and backward fill
df[groups.bfill() == groups.ffill()]
Output:
Col1 Col2
0 Start A 1
1 value 1 2
2 value 2 3
3 value 3 4
4 value 5 5
5 End A 6
9 Start B 1
10 value 1 2
11 value 2 3
12 value 3 4
13 value 5 5
14 End B 6
18 Start C 1
19 value 1 2
20 value 2 3
21 value 3 4
22 value 5 5
23 End C 6
CodePudding user response:
The answer you linked to was designed to work with a single pair of Start/End.
A more generic variant of it would be to check for the parity of the group (assuming strictly alternating Start/End):
m = df['Col1'].str.match(r'Start|End').cumsum().mod(2).eq(1)
# boolean indexing
out = df[m1|m1.shift()]
Alternatively, use each Start as a flag to keep the following rows and each End as a flag to drop them. This wouldn't however consider the A/B/C letter after the Start/End like the nice answer of @Quang does:
# extract Start/End
s = df['Col1'].str.extract(r'^(Start|End)', expand=False)
# set flags and ffill
m1 = s.replace({'Start': True, 'End': False}).ffill()
# boolean slicing
out = df[m1|m1.shift()]
Output:
Col1 Col2
0 Start A 1
1 value 1 2
2 value 2 3
3 value 3 4
4 value 5 5
5 End A 6
9 Start B 1
10 value 1 2
11 value 2 3
12 value 3 4
13 value 5 5
14 End B 6
18 Start C 1
19 value 1 2
20 value 2 3
21 value 3 4
22 value 5 5
23 End C 6