Home > Back-end >  How to extract paragraphs from text in Python using regex
How to extract paragraphs from text in Python using regex

Time:04-13

I'm new to regex and trying to extract paragraphs from PDF-extracted text. The criteria I'd like to set is: any segment beginning with a capital letter and ending with a full-stop/question mark/exclamation point followed by a newline character. As an example, if I have the string

"but Okay. So.\n What do we\n do now?\n I have no clue.\n"

I would like to get back ["Okay. So.\n", "What do we\n do now?\n", "I have no clue.\n"] as three separate chunks.

So far, this is what I've tried:

re.findall("[A-Z].*[\.\?\!]\n",text,re.DOTALL)

but it's not working, i.e. it's not returning the separate chunks in the text. For the example provided, I'm just getting back the entire string as a single time. Any help would be greatly appreciated!

Edit: I should've clarified and am realising my provided examples didn't cover everything I'm trying to avoid--my extracted text has a bunch of newline characters without fullstops preceding them, and I'd like to keep those intact. There are also stray bits starting with lowercase letters that aren't starts to actual paragraphs, so I'd like to exclude those as well and begin match from the uppercase letter.

CodePudding user response:

Here you go:

import re
text = "Okay. So.\n What do we do now?\n I have no clue.\n"
print(re.findall(".*\n",text))

... gives:

['Okay. So.\n', ' What do we do now?\n', ' I have no clue.\n']

UPDATED: Behavior described in OPs clarified question can be achieved like this:

import re
text = "but Okay. So.\n What do we\n do now?\n I have no clue.\n"
print(re.findall('[A-Z].*?[?!.]\n', text, re.S))

Output:

['Okay. So.\n', 'What do we\n do now?\n', 'I have no clue.\n']

Explanation:

  • [A-Z] matches the leading uppercase letter
  • .*? matches 0 or more characters (including \n since flags=re.S) in a non-greedy way thanks to the ? as detailed in the docs as follows:

?, ?, ?? The '', ' ', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.> is matched against ' b ', it will match the entire string, and not just ''. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.?> will match only ''.

  • [?!.]\n matches on occurrence of '?', '!' or '.' followed by '\n', and thanks to the non-greedy qualifier ? in the preceding bullet, it doesn't skip over any matches to this but instead divided the string into "paragraphs" as described in your question.
  • Related