Home > other >  Ignoring irrelevant section in regex
Ignoring irrelevant section in regex

Time:11-18

I have the following text: (this is closely related to this but not similar)

text = '7\n\x0c\n7.\tA B C\n\n7.1\tbla bla bla .\n\n7.2\tanother bla bla \n\n7.3\tand another one.\n\n8.\tX Y Z\n\n8.1\tha ha ha \n\n(a)\thohoho ;\n\n(b)\thihihi,\n\n8'

I wish to select the section 7 only such that I get:

7.  A B C

7.1 bla bla bla .

7.2 another bla bla 

7.3 and another one.

So I do:

print(re.findall(r'^\d \.\s*A B C[^\S\n]*(?:\n\n. )*', text, re.M)[0])

which gives:

7.  A B C

7.1 bla bla bla .

7.2 another bla bla 

7.3 and another one.

8.  X Y Z

8.1 ha ha ha 

(a) hohoho ;

(b) hihihi,

8

As you can see 8 comes after 8.1. So this seems to be confusing for the regex, I wonder what can I do in this case?

Note that the number of the sections can be different in general, so I can not do something like re.findall(r'^7\..*', text, re.MULTILINE) (namely A B C can be places in other sections).

CodePudding user response:

You can use

^(\d \.)\s*A B C(?:\s*\n\1\b.*)*
  • ^ Start of string
  • (\d \.)\s*A B C Capture group 1 to match 1 digits and ., then match A B C
  • (?: Non capture group to match as a whole
    • \s*\n Match optional whitespace chars and a newline
    • \1\b A backreference to group 1 followed by a word boundary
    • .* Match the rest of the line
  • )* Close the non capture group and optionally repeat it

See a regex demo.

import re

text = '7\n\x0c\n7.\tA B C\n\n7.1\tbla bla bla .\n\n7.2\tanother bla bla \n\n7.3\tand another one.\n\n8.\tX Y Z\n\n8.1\tha ha ha \n\n(a)\thohoho ;\n\n(b)\thihihi,\n\n8'
pattern = r"^(\d \.)\s*A B C(?:\s*\n\1\b.*)*"

m = re.search(pattern, text, re.MULTILINE)
if m:
    print(m.group())

Output

7.      A B C

7.1     bla bla bla .

7.2     another bla bla 

7.3     and another one.

CodePudding user response:

Here is a potential solution to find the lines between 7. and the first 8.:

m = re.search(r'^7\.\s*(?:.|\n) ?(?=\n8\.)', text, re.MULTILINE)
print(m.group() if m else None)

output:

7.  A B C

7.1 bla bla bla .

7.2 another bla bla 

7.3 and another one.

  • Related