Home > front end >  Confusing: python regex does not capture a working regex pattern
Confusing: python regex does not capture a working regex pattern

Time:02-01

I am using regex to capture a string from a word file (and many such word files). But weirdly enough, a seemingly good regex pattern (working on regex101.com) is not working on python.

Just in case it has something to do with the word file, I am attaching a drive link here for your reference.

# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword

# setting directory
os.chdir('/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs/bidsummaries-doc-test')

text = textract.process('/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs/bidsummaries-doc/081204R0.doc_133.doc')
text = text.decode("utf-8")

nob = text.split('BID OPENING DATE')
del nob[0]

txt = nob[0]

engineers_estimate = re.search('ENGINEERS EST\s (?:^|\s)(?=.)((?:0|(?:[1-9](?:\d*|\d{0,2}(?:,\d{3})*)))?(?:\.\d*[0-9])?)(?!\S)', txt)
if not (engineers_estimate is None):
    engineers_estimate = engineers_estimate.group(1)
else:
    engineers_estimate = 'Not captured'

amount_under_over = re.search('(AMOUNT (?:OVER|UNDER))\s ((?:\d{1,3}(?:\,\d{3})*(?:\.\d\d)?))\b', txt)
if not (amount_under_over is None):
    amount_under_over1 = amount_under_over.group(2)
else:    
    amount_under_over1 = 'Not captured'

The code successfully captures the engineers_estimate variable but cannot capture anything for amount_under_over.

print(amount_uner_over) returns None.

According to this regex101 template, the code should capture the relevant amount under over string. Thank you so much!

Edit: Removing \b from the pattern worked! I'm not sure why it worked though.

CodePudding user response:

I think the problem is escape characters which are allowed in Python strings by default. You can use r before your string to indicate it is a raw string, for example: engineers_estimate = re.search(r'ENGINEERS EST\s (?:^|\s)(?=.)((?:0|(?:[1-9](?:\d*|\d{0,2}(?:,\d{3})*)))?(?:\.\d*[0-9])?)(?!\S)', txt)

Removing \b fixed your problem because that is an escape character Backspace.

  • Related