I am trying to extract chapters/sections for txt files whose were generated using pdftotext on portuguese Lawsuits documents. Initially I tried this regex to, at least, get each chapter title:
^[A-Z\s\d\W] $
Apparently it had worked for this example:
So, how can I get not only each chapter/section title but each content of them too?
I tried a regex to get each chapter and its content but not worked very well in some documents
CodePudding user response:
An approach using 2 capture groups:
^[^\S\n]*([A-Z][^a-z]*)((?:\n(?![^\S\n]*[A-Z][^a-z\n]*$).*)*)$
^
Start of string[^\S\n]*
Match optional spaces without newlines(
Capture group 1[A-Z][^a-z]*
Match a single uppercase char followed by any char except a lowercase a-z
)
Close group(
Capture group 2(?:\n(?![^\S\n]*[A-Z][^a-z\n]*$).*)*
Optionally repeat matching all lines that do not start with a title like pattern
)
Close group$
End of string
A bit more pcre like approach:
^\h*([A-Z][^a-z]*)((?>\R(?!\h*[A-Z][^a-z\r\n]*$).*)*)$