Extract number from text using Python-CodePudding

I have extracted text from an image using tesseract, and now have the entire text. However, I only want to extract company number “123456”, which is a random 6-digit number. I want to use this in order to save the file as this company number, so they can be identified more easily.

My Question: If I have a text containing bytes and unicode, what is the easiest way to extract this 6-digit number?

CodePudding user response：

You can use regex:

import re

string = 'some random text with a 6-digit number 123456 somewhere'
res = re.findall(r'\b\d{6}\b', string)
print(res)

Output:

['123456']

Explanation:

\d{6}: match exactly 6 digits
\b: ensure partial numbers are not matched, e.g. don't get "123456" from "1234567"

CodePudding user response：

An example of the text you are trying to convert would help.

However, you could easily extract the subset of characters that match a given criteria:

s = 'abc12ÄÄ34$$'  # sample string
digits = [ch for ch in s if ch.isnumeric()] # returns a list with only the numeric charcters
digits = ''.join(digits)  # if you want a string