I want to find all the occurrences of an specific term (and its variations) in a word document.
- Extracted the text from the word document
- Try to find pattern via regex
The pattern consists of words that start with DOC- and after the - there are 9 digits.
I have tried the following without success:
document variable is the extracted text with the following function:
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
- pattern = re.compile('^DOC.\d{9}$')
- pattern.findall(document)
pattern.findall(document)
Can someone help me?
Thanks in advance
CodePudding user response:
You can use a combinbation of word and numeric right-hand boundaries.
Also, you say there must be a dash after DOC
, but you use a .
in the pattern. I believe you wanted to also match any en- or em-dash, so I'd suggest to use a more precise pattern, like [-–—]
. Note there are other ways to match any Unicode dash char, see Searching for all Unicode variation of hyphens in Python.
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
print( re.findall(r'\bDOC[-–—]\d{9}(?!\d)', getText(filename)) )
Details:
\b
- a word boundaryDOC
-DOC
substring[-–—]
- a dash symbol (hyphen, en- or em-dash)\d{9}
- nine digits(?!\d)
- immediately to the right of the current location, there must be no digit.