Home > Software engineering >  Extract number from text using Python
Extract number from text using Python

Time:12-31

I have extracted text from an image using tesseract, and now have the entire text. However, I only want to extract company number “123456”, which is a random 6-digit number. I want to use this in order to save the file as this company number, so they can be identified more easily.

My Question: If I have a text containing bytes and unicode, what is the easiest way to extract this 6-digit number?

CodePudding user response:

You can use regex:

import re

string = 'some random text with a 6-digit number 123456 somewhere'
res = re.findall(r'\b\d{6}\b', string)
print(res)

Output:

['123456']

Explanation:

  • \d{6}: match exactly 6 digits
  • \b: ensure partial numbers are not matched, e.g. don't get "123456" from "1234567"

CodePudding user response:

An example of the text you are trying to convert would help.

However, you could easily extract the subset of characters that match a given criteria:

s = 'abc12ÄÄ34$$'  # sample string
digits = [ch for ch in s if ch.isnumeric()] # returns a list with only the numeric charcters
digits = ''.join(digits)  # if you want a string
  • Related