How to setup Tesseract OCR properly-CodePudding

I am using Tesseract OCR trying to convert a preprocessed license plate image into text, but I have not had much success with some images which look very much OK. The tesseract setup can be seen in the function definition. I am running this on Google Colab. The input image is ZG NIVEA 1 below. I am not sure if I am using something wrong or if there is a better way to do this - the result I get form this particular image is A.

!sudo apt install -q tesseract-ocr
!pip install -q pytesseract
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'
import cv2
import re

def pytesseract_image_to_string(img, oem=3, psm=7) -> str:
  '''
  oem - OCR Engine Mode
      0 = Original Tesseract only.
      1 = Neural nets LSTM only.
      2 = Tesseract   LSTM.
      3 = Default, based on what is available.
  psm - Page Segmentation Mode
      0 = Orientation and script detection (OSD) only.
      1 = Automatic page segmentation with OSD.
      2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
      3 = Fully automatic page segmentation, but no OSD. (Default)
      4 = Assume a single column of text of variable sizes.
      5 = Assume a single uniform block of vertically aligned text.
      6 = Assume a single uniform block of text.
      7 = Treat the image as a single text line.
      8 = Treat the image as a single word.
      9 = Treat the image as a single word in a circle.
      10 = Treat the image as a single character.
      11 = Sparse text. Find as much text as possible in no particular order.
      12 = Sparse text with OSD.
      13 = Raw line. Treat the image as a single text line,
          bypassing hacks that are Tesseract-specific.
  '''
  tess_string = pytesseract.image_to_string(img, config=f'--oem {oem} --psm {psm}')
  regex_result = re.findall(r'[A-Z0-9]', tess_string) # filter only uppercase alphanumeric symbols
  return ''.join(regex_result)

image = cv2.imread('nivea.png')
print(pytesseract_image_to_string(image))

Edit: The approach in the accepted answer works for the ZGNIVEA1 image, but not for others, e.g. , is there a general "font size" that Tesseract OCR works with best, or is there a rule of thumb?

CodePudding user response：

by applying gaussian blur before OCR, I ended up with the correct output. Also, you may not need to use regex by adding -c tessedit_char_whitelist=ABC.. to your config string.

The code that produces correct output for me:

import cv2
import pytesseract

image = cv2.imread("images/tesseract.png")

config = '--oem 3  --psm 7 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ'

image = cv2.resize(image, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
image = cv2.GaussianBlur(image, (5, 5), 0)

string = pytesseract.image_to_string(image, config=config)

print(string)

Output: