I am trying to detect specific words (with a regex pattern that I already have) in a Word Document. I do not only want to detect the word but also to know in which page it appears, I think of something like a list of tuples: [(WordA, 10), (WordB, 4) ....]
I am able to extract the text from the word document and detect all the words that match the regex pattern but I am not able to know if which page the word appears. Also, I want to detect all the occurrences regardless if they appear in the header, body or footnotes.
Here is my regex pattern:
pattern = re.compile(r'\bDOC[-–—]\d{9}(?!\d)')
Extraction of text:
import docx2txt
result = docx2txt.process("Word_Document.docx")
Thank you in advance,
CodePudding user response:
I just wanted to say thank you to those who tried to answer this question. I found two solutions:
With Word Documents, splitting them into one word document per page with Aspose: https://products.aspose.cloud/words/python/split/
Convert the Word Document into PDF and then create one PDF per page with PyPDF2 or other library E
CodePudding user response:
Ok, after a while of trying to figure this out, I managed to get this:
import docx2txt.docx2txt as docx2txt
import re
page_contents = []
def xml2text(xml):
text = u''
root = docx2txt.ET.fromstring(xml)
start = 0
for child in root.iter():
if child.tag == docx2txt.qn('w:t'):
t_text = child.text
text = t_text if t_text is not None else ''
elif child.tag == docx2txt.qn('w:tab'):
text = '\t'
elif child.tag in (docx2txt.qn('w:br'), docx2txt.qn('w:cr')):
text = '\n'
elif child.tag == docx2txt.qn("w:p"):
text = '\n\n'
elif child.tag == docx2txt.qn('w:lastRenderedPageBreak'):
end = len(text) 1
page_contents.append(text[start:end])
start = len(text)
page_contents.append(text[start:len(text) 1])
return text
docx2txt.xml2text = xml2text
docx2txt.process('test_file.docx') # use your filename
matches = []
pattern = re.compile(r'\bDOC[-–—]\d{9}(?!\d)')
for page_num, page_content in enumerate(page_contents, start=1):
# do regex search
all_matches = pattern.findall(page_content)
if all_matches:
for match in all_matches:
matches.append((match, page_num))
print(matches)
It modifies the module's function so that when it is called it will add each page to a list and the index 1 will be the page number. It modifies the module's xml2text parser to additionally detect a page break and then add that pages contents to the local global list. It uses the tag 'lastRenderedPageBreak'
, the slight caution is to save the file if you have edited it so that the placement of these tags also gets updated.