Detect specific pattern regex python-CodePudding

I want to find all the occurrences of an specific term (and its variations) in a word document.

Extracted the text from the word document
Try to find pattern via regex

The pattern consists of words that start with DOC- and after the - there are 9 digits.

I have tried the following without success:

document variable is the extracted text with the following function:

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

pattern = re.compile('^DOC.\d{9}$')
pattern.findall(document)

pattern.findall(document)

Can someone help me?

Thanks in advance

CodePudding user response：

You can use a combinbation of word and numeric right-hand boundaries.

Also, you say there must be a dash after DOC, but you use a . in the pattern. I believe you wanted to also match any en- or em-dash, so I'd suggest to use a more precise pattern, like [-–—]. Note there are other ways to match any Unicode dash char, see Searching for all Unicode variation of hyphens in Python.

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

print( re.findall(r'\bDOC[-–—]\d{9}(?!\d)', getText(filename)) )

Details:

\b - a word boundary
DOC - DOC substring
[-–—] - a dash symbol (hyphen, en- or em-dash)
\d{9} - nine digits
(?!\d) - immediately to the right of the current location, there must be no digit.