I am trying to use regex to grab text from in between sections of a document that have numbered headers. The document has a table of contents and section headers with periods in the numbers for the sections. Ex: 1. Introduction, 1.1 Something, 1.1.1 Something Else I'm able to parse the TOC just fine and get just the section numbers (1.1, 1.1.1, etc.) and am failing in trying to parse the text of the document between those two numbers.
Consider the following (given the document text is just one big string):
1.1 Introduction
There are some sentences in here that I want and I want to do other things with them. There could be hundreds of sentences, who cares.
1.1.1 Something Else
This is where we talk about something else in life.
...
5.1.1 Conclusion
I have tried the following to get the text between 1.1 and 1.1.1 for example and a few variations of such and seem stuck.
(?s)1\.0(.*)1\.1
This works if the only thing in the document is sections 1.0 and 1.1 but since I don't have that luxury.... any help is greatly appreciated.
CodePudding user response:
You might use 2 capture groups and a negative lookahead to match all lines not starting with the digits and dot:
^\d (?:\.\d ) \b(.*)((?:\n(?!\d \.\d).*)*)
The pattern matches:
^
Start of string\d (?:\.\d )
Match 1 digits, and repeat 1 times a.
and 1 digits\b
A word boundary(.*)
Capture group 1, match the rest of the line(
Capture group 2(?:\n(?!\d \.\d).*)*
Match a newline and the rest of the line if it does not start with digits and a dot
)
Close group 2
Example
import re
pattern = r"^\d (?:\.\d ) \b(.*)((?:\n(?!\d \.\d).*)*)"
s = ("1.1 Introduction\n"
"There are some sentences in here that I want and I want to do other things with them. There could be hundreds of sentences, who cares.\n"
"1.1.1 Something Else\n"
"This is where we talk about something else in life.\n"
"...\n"
"5.1.1 Conclusion")
print(re.findall(pattern, s, re.M))
Output
[(' Introduction', '\nThere are some sentences in here that I want and I want to do other things with them. There could be hundreds of sentences, who cares.'), (' Something Else', '\nThis is where we talk about something else in life.\n...'), (' Conclusion', '')]
CodePudding user response:
Use re.split
to split on the numbers using a regex like the one below.
^\d (?:\.\d )*
This matches one or more digits \d
followed by zero or more occurrences of the subpattern, period followed by one or more digits (?:\.\d )*
.
The items of the resulting list are then the text between the numbers including the text on the header line itself.
If you need the section numbers too, use a capturing pattern in the regex (add parentheses around what you already have). The list will then have both the section numbers and the text between them. Even-numbered items are the text between, and odd-numbered items are the section numbers.
CodePudding user response:
I'm not completely sure of how it's being used in your python code, but here's a regex that might help:
/([\d\.] )/g
or in python:
import re
matches = re.findall("([\d\.] )", your_string)
As an explanation:
\d
means any numeral char (0-9)\.
means a literal.
[<multiple_things>]
means any one of<mutliple_things>
So the regex is matching a number or period any number of times in a row, as long as there is nothing in between them.
# Some examples it would match:
1
.
1.1
1.1.1
11.1.111
1.11111111.111111
1.1.
.1
1....
....
1111
# Examples it would NOT match:
1 .1
1a.2
CodePudding user response:
text='''1.1 Introduction
There are some sentences in here that I want and I want to do other things with them. There could be hundreds of sentences, who cares.
1.1.1 Something Else
This is where we talk about something else in life.
...
5.1.1 Conclusion'''
for e in re.findall(r'^[^\d\.] ', text,re.MULTILINE):
print(e)
Introduction
There are some sentences in here that I want and I want to do other things with them
There could be hundreds of sentences, who cares
Something Else
This is where we talk about something else in life