Home > Blockchain >  Multiline regex in pdf file
Multiline regex in pdf file

Time:11-07

I am interested in extracting some information from some PDF files that look like this. I only need the information at pages 2 and after which looks like this:

  1. (U) country: On [date] [text]. (text in brackets)

This means it always starts with a number a dot a country and finishes with brackets which brackets may also go to the next line too.

My implementation in python is the following:

  1. use pdfminer extract_text function to get the whole text.
  2. Then use re.findall function in the whole text using this regex ^\d{1,2}\. \(u\) \w .\w*.\w*:.* on \d{1,2} \w .*$ with the re.MULTILINE option too.

I have noticed that this extracts the first line of all the paragraphs that I am interested in, but I cannot find a way to grab everything until the end of the paragraph which is the brackets (.*).

I was wondering if anyone can provide some help into this. I was hoping I can match this by only one regex. Otherwise I might try split it by line and iterate through each one.

Thanks in advance.

CodePudding user response:

You could update the pattern using a negated character class matching until the first occurrence of : and then match at least on after it.

To match all following line, you can match a newline and assert that the nextline does not contain only spaces followed by a newline using a negative lookahead.

Using a case insensitive match:

^\d{1,2}\.\s\(u\)\s[^:\n]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*

The pattern matches:

  • ^ Start of string
  • \d{1,2}\.\s\(u\)\s Match 2 digits, . a whitespace char and (u)
  • [^:\n]*: Match any char except : or a newline, then match :
  • .*?\son\s Match the first occurrence of on between whitespace chars
  • \d{1,2}\s Match 1-2 digits and a whitespace char
  • .* Match the rest of the line
  • (?: Non capture group
    • \n(?![^\S\r\n]*\n).* Match a newline, and assert not only spaces followed by a newline
  • )* Close non capture group and optionally repeat

Regex demo

For example

pattern = r"^\d{1,2}\.\s\(u\)\s[^:]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*"

print(re.findall(pattern, extracted_text, re.M | re.I))
  • Related