Home > other >  Detect specific pattern regex python
Detect specific pattern regex python

Time:11-02

I want to find all the occurrences of an specific term (and its variations) in a word document.

  1. Extracted the text from the word document
  2. Try to find pattern via regex

The pattern consists of words that start with DOC- and after the - there are 9 digits.

I have tried the following without success:

document variable is the extracted text with the following function:

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)
  1. pattern = re.compile('^DOC.\d{9}$')
  2. pattern.findall(document)

pattern.findall(document)

Can someone help me?

Thanks in advance

CodePudding user response:

You can use a combinbation of word and numeric right-hand boundaries.

Also, you say there must be a dash after DOC, but you use a . in the pattern. I believe you wanted to also match any en- or em-dash, so I'd suggest to use a more precise pattern, like [-–—]. Note there are other ways to match any Unicode dash char, see Searching for all Unicode variation of hyphens in Python.

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

print( re.findall(r'\bDOC[-–—]\d{9}(?!\d)', getText(filename)) )

Details:

  • \b - a word boundary
  • DOC - DOC substring
  • [-–—] - a dash symbol (hyphen, en- or em-dash)
  • \d{9} - nine digits
  • (?!\d) - immediately to the right of the current location, there must be no digit.
  • Related