Detect word and page of occurrence in Word Document-CodePudding

I am trying to detect specific words (with a regex pattern that I already have) in a Word Document. I do not only want to detect the word but also to know in which page it appears, I think of something like a list of tuples: [(WordA, 10), (WordB, 4) ....]

I am able to extract the text from the word document and detect all the words that match the regex pattern but I am not able to know if which page the word appears. Also, I want to detect all the occurrences regardless if they appear in the header, body or footnotes.

Here is my regex pattern:

pattern = re.compile(r'\bDOC[-–—]\d{9}(?!\d)')

Extraction of text:

import docx2txt
 
result = docx2txt.process("Word_Document.docx")

Thank you in advance,

CodePudding user response：

I just wanted to say thank you to those who tried to answer this question. I found two solutions:

With Word Documents, splitting them into one word document per page with Aspose: https://products.aspose.cloud/words/python/split/
Convert the Word Document into PDF and then create one PDF per page with PyPDF2 or other library E

CodePudding user response：

Ok, after a while of trying to figure this out, I managed to get this:

import docx2txt.docx2txt as docx2txt
import re

page_contents = []


def xml2text(xml):
    text = u''
    root = docx2txt.ET.fromstring(xml)
    start = 0
    for child in root.iter():
        if child.tag == docx2txt.qn('w:t'):
            t_text = child.text
            text  = t_text if t_text is not None else ''
        elif child.tag == docx2txt.qn('w:tab'):
            text  = '\t'
        elif child.tag in (docx2txt.qn('w:br'), docx2txt.qn('w:cr')):
            text  = '\n'
        elif child.tag == docx2txt.qn("w:p"):
            text  = '\n\n'
        elif child.tag == docx2txt.qn('w:lastRenderedPageBreak'):
            end = len(text)   1
            page_contents.append(text[start:end])
            start = len(text)
    page_contents.append(text[start:len(text)   1])
    return text


docx2txt.xml2text = xml2text
docx2txt.process('test_file.docx')  # use your filename

matches = []
pattern = re.compile(r'\bDOC[-–—]\d{9}(?!\d)')
for page_num, page_content in enumerate(page_contents, start=1):
    # do regex search
    all_matches = pattern.findall(page_content)
    if all_matches:
        for match in all_matches:
            matches.append((match, page_num))

print(matches)

It modifies the module's function so that when it is called it will add each page to a list and the index 1 will be the page number. It modifies the module's xml2text parser to additionally detect a page break and then add that pages contents to the local global list. It uses the tag 'lastRenderedPageBreak', the slight caution is to save the file if you have edited it so that the placement of these tags also gets updated.