I am interested in extracting some information from some PDF files that look like this. I only need the information at pages 2 and after which looks like this:
- (U) country: On [date] [text]. (text in brackets)
This means it always starts with a number a dot a country and finishes with brackets which brackets may also go to the next line too.
My implementation in python is the following:
- use pdfminer extract_text function to get the whole text.
- Then use re.findall function in the whole text using this regex
^\d{1,2}\. \(u\) \w .\w*.\w*:.* on \d{1,2} \w .*$
with the re.MULTILINE option too.
I have noticed that this extracts the first line of all the paragraphs that I am interested in, but I cannot find a way to grab everything until the end of the paragraph which is the brackets (.*).
I was wondering if anyone can provide some help into this. I was hoping I can match this by only one regex. Otherwise I might try split it by line and iterate through each one.
Thanks in advance.
CodePudding user response:
You could update the pattern using a negated character class matching until the first occurrence of :
and then match at least on
after it.
To match all following line, you can match a newline and assert that the nextline does not contain only spaces followed by a newline using a negative lookahead.
Using a case insensitive match:
^\d{1,2}\.\s\(u\)\s[^:\n]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*
The pattern matches:
^
Start of string\d{1,2}\.\s\(u\)\s
Match 2 digits,.
a whitespace char and(u)
[^:\n]*:
Match any char except:
or a newline, then match:
.*?\son\s
Match the first occurrence ofon
between whitespace chars\d{1,2}\s
Match 1-2 digits and a whitespace char.*
Match the rest of the line(?:
Non capture group\n(?![^\S\r\n]*\n).*
Match a newline, and assert not only spaces followed by a newline
)*
Close non capture group and optionally repeat
For example
pattern = r"^\d{1,2}\.\s\(u\)\s[^:]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*"
print(re.findall(pattern, extracted_text, re.M | re.I))