I'm trying to use a regex statement to extract a specific block of text between two known phrases that will be repeated in other documents, and remove everything else. These few sentences will then be passed into other functions.
My problem seems to be that when I use a regex statement that has the words im searching for on the same line, it works. If they're on different lines I get:
print(match.group(1).strip())
AttributeError: 'NoneType' object has no attribute 'group'
I'm expecting future reports to have line breaks at different points depending on what was written before - is there a way to prepare the text first by removing all line breaks, or to make my regex statement ignore those when searching?
Any help would be great, thanks!
import fitz
import re
doc = fitz.open(r'file.pdf')
text_list = [ ]
for page in doc:
text_list.append(page.getText())
#print(text_list[-1])
text_string = ' '.join(text_list)
test_string = "Observations of Client Behavior: THIS IS THE DESIRED TEXT. Observations of Client's response to skill acquisition" #works for this test
pat = r".*?Observations of Client Behavior: (.*) Observations of Client's response to skill acquisition*"
match = re.search(pat, text_string)
print(match.group(1).strip())
When I make my pat being searched for phrases that are on the same line in the long text file, it works. But as soon as they are on different lines, it no longer works.
Here is a sample of the input text giving me an issue:
Observations of Client Behavior: Overall interfering behavior data trends are as followed: Aggression frequency
has been low and stable at 0 occurrences for the past two consecutive sessions. Elopement frequency is on an
overall decreasing trend. Property destruction frequency is on an overall decreasing trend. Non-compliance
frequency has been stagnant at 2 occurrences for the past two consecutive sessions, but overall on a
decreasing trend. Tantrum duration data are variable; data were at 89 minutes on 9/27/21, but have starkly
decreased to 0 minutes for the past two consecutive sessions. Observations of Client's response to skill
acquisition: Overall skill acquisition data trends are as followed: Frequency of excessive mands
CodePudding user response:
Note that .
Matches any character other than newline. So you could use (.|\n)
to capture everything. Also, it seems that the line could break inside your fixed pattern. first define prefix and suffix of the pattern:
prefix=r"Observations\s of\s Client\s Behavior:"
sufix=r"Observations\s of\s Client's\s response\s to\s skill\s acquisition:"
and then create pattern and find all occurrences:
pattern=prefix r"((?:.|\n)*?)" suffix
f=re.findall(pattern,text_string)
By using *?
at the end of r"((?:.|\n)*?)"
we matches as few characters as possible.
Example of multi-line multi-pattern:
text_string = '''any thing Observations of Client Behavior: patern1 Observations of Client's
response to skill acquisition: any thing
any thing Observations of Client Behavior: patern2 Observations of
Client's response to skill acquisition: any thing Observations of Client
Behavior: patern3 Observations of Client's response to skill acquisition: any thing any thing'''
result=re.findall(pattern,text_string)
result=[' patern1 ', ' patern2 ', ' patern3 ']
check the result here