Multiline regex in pdf file-CodePudding

I am interested in extracting some information from some PDF files that look like this. I only need the information at pages 2 and after which looks like this:

(U) country: On [date] [text]. (text in brackets)

This means it always starts with a number a dot a country and finishes with brackets which brackets may also go to the next line too.

My implementation in python is the following:

use pdfminer extract_text function to get the whole text.
Then use re.findall function in the whole text using this regex ^\d{1,2}\. $u$ \w .\w*.\w*:.* on \d{1,2} \w .*$ with the re.MULTILINE option too.

I have noticed that this extracts the first line of all the paragraphs that I am interested in, but I cannot find a way to grab everything until the end of the paragraph which is the brackets (.*).

I was wondering if anyone can provide some help into this. I was hoping I can match this by only one regex. Otherwise I might try split it by line and iterate through each one.

Thanks in advance.

CodePudding user response：

You could update the pattern using a negated character class matching until the first occurrence of : and then match at least on after it.

To match all following line, you can match a newline and assert that the nextline does not contain only spaces followed by a newline using a negative lookahead.

Using a case insensitive match:

^\d{1,2}\.\s\(u\)\s[^:\n]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*

The pattern matches:

^ Start of string
\d{1,2}\.\s$u$\s Match 2 digits, . a whitespace char and (u)
[^:\n]*: Match any char except : or a newline, then match :
.*?\son\s Match the first occurrence of on between whitespace chars
\d{1,2}\s Match 1-2 digits and a whitespace char
.* Match the rest of the line
(?: Non capture group
- \n(?![^\S\r\n]*\n).* Match a newline, and assert not only spaces followed by a newline
)* Close non capture group and optionally repeat

Regex demo

For example

pattern = r"^\d{1,2}\.\s\(u\)\s[^:]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*"

print(re.findall(pattern, extracted_text, re.M | re.I))