After extract text from PDFs files using pdftotext, I am trying to recover some their titles and respective contents.
This batch of files have a pattern of a new line followed by a roman number followed (or not) by dot or hyphen and the title followed by break line.
So I tried this pattern:
^[^\S\n]*([CLXVI]{1,7})\.\s?(.*?)\n([\S\s]*)(?=[CLXVI]{1,7})
But did not worked as expected:
https://regex101.com/r/vX4aB4/1
The expected result was something like:
group title -> Breve Síntese da Demanda
group content -> Lorem ipsum dolor ... faucibus.
group title -> Bla Bla bla
group content -> Lorem ipsum dolor ... faucibus.
group title -> Do Mérito
group content -> Lorem ipsum dolor ... commodo.
group title -> Conclusão
group content -> Lorem ipsum dolor ... .
So how Can I improve that to recover properly each title and their respective contents?
CodePudding user response:
You can use a negative lookahead to prevent skipping over, e.g.
^(\h* [CLXVI]{1,7}\.)\h*(. )\s*((?:(?!(?1)).*\R?)*)
See your updated demo at regex101 - Use in (?m)
multiline mode
The relevant part (?!(?1))
prevents skipping over first group pattern.
This is a PCRE regex, it uses group reference and possessive quantifier.