I'm using AWS Textract to pull information from PDF documents. After the scanned text is returned from AWS and persisted to a var, I'm doing this:
phone_number = '(555) 123-4567'
scanned_pdf_text.should have_text phone_number
But this fails about 20% of the time because of the non-deterministic way that AWS is returning the scanned PDF text. On occasion, the phone numbers can appear either of these two ways:
(555)123-4567
or (555) 123-4567
Some of this scanned text is very large, and I'd prefer not to go through the exercise of sanitizing the text coming back if I can avoid it (I'm also not good at regex usage). I also think using or
logic to handle both cases seems to be a little heavy handed just to check text that is so similar (and clearly near-identical to the human eye).
Is there an rspec
matcher that'll allow me to check on this text? I'm also using Capybara.default_normalize_ws = true
but that doesn't seem to help in this case.
CodePudding user response:
Assuming scanned_pdf_text
is a string and the only differences you're seeing is in spaces then you can just get rid of the spaces and compare
scanned_pdf_text.gsub(/\s /, '').should eq('(555)123-4567') # exact
scanned_pdf_text.gsub(/\s /, '').should match('(555)123-4567') # partial
scanned_pdf_text.gsub(/\s /, '').should have_text('(555)123-4567') # partial