I have extracted text from an image using tesseract, and now have the entire text. However, I only want to extract company number “123456”, which is a random 6-digit number. I want to use this in order to save the file as this company number, so they can be identified more easily.
My Question: If I have a text containing bytes and unicode, what is the easiest way to extract this 6-digit number?
CodePudding user response:
You can use regex:
import re
string = 'some random text with a 6-digit number 123456 somewhere'
res = re.findall(r'\b\d{6}\b', string)
print(res)
Output:
['123456']
Explanation:
\d{6}
: match exactly 6 digits\b
: ensure partial numbers are not matched, e.g. don't get "123456" from "1234567"
CodePudding user response:
An example of the text you are trying to convert would help.
However, you could easily extract the subset of characters that match a given criteria:
s = 'abc12ÄÄ34$$' # sample string
digits = [ch for ch in s if ch.isnumeric()] # returns a list with only the numeric charcters
digits = ''.join(digits) # if you want a string