Home > Software engineering >  Regex to get text between chapters titles in upper-case
Regex to get text between chapters titles in upper-case

Time:12-20

I am trying to extract chapters/sections for txt files whose were generated using pdftotext on portuguese Lawsuits documents. Initially I tried this regex to, at least, get each chapter title:

^[A-Z\s\d\W] $

Apparently it had worked for this example: enter image description here enter image description here

So, how can I get not only each chapter/section title but each content of them too?

I tried a regex to get each chapter and its content but not worked very well in some documents

CodePudding user response:

An approach using 2 capture groups:

^[^\S\n]*([A-Z][^a-z]*)((?:\n(?![^\S\n]*[A-Z][^a-z\n]*$).*)*)$
  • ^ Start of string
  • [^\S\n]* Match optional spaces without newlines
  • ( Capture group 1
    • [A-Z][^a-z]* Match a single uppercase char followed by any char except a lowercase a-z
  • ) Close group
  • ( Capture group 2
    • (?:\n(?![^\S\n]*[A-Z][^a-z\n]*$).*)* Optionally repeat matching all lines that do not start with a title like pattern
  • ) Close group
  • $ End of string

Regex demo

A bit more pcre like approach:

^\h*([A-Z][^a-z]*)((?>\R(?!\h*[A-Z][^a-z\r\n]*$).*)*)$

Regex demo

  • Related