Home > Enterprise >  Tesseract very low detection quality
Tesseract very low detection quality

Time:04-18

Trying to read some data with tesseract but it's already strugling with date and time, so I created a minimal test case.

code:

#include <string>
#include <sstream>
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <opencv2/opencv.hpp>
#include <opencv2/imgproc.hpp>
#include <boost/algorithm/string/trim.hpp>
using namespace std;
using namespace cv;

int main(int argc, const char * argv[]) {

    string outText, imPath = argv[1];
    cv::Mat image_final = cv::imread(imPath, CV_8UC1);

    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    api->Init(NULL, "eng", tesseract::OEM_LSTM_ONLY);
    api->SetPageSegMode(tesseract::PSM_AUTO_ONLY);
    cv::adaptiveThreshold(image_final,image_final,255,ADAPTIVE_THRESH_MEAN_C, cv::THRESH_BINARY,11,2);

    api->SetImage(image_final.data, image_final.cols, image_final.rows, 3, image_final.step);
    api->SetVariable("tessedit_char_whitelist", "0123456789- :");
    outText = string(api->GetUTF8Text());
    api->End();

    std::istringstream iss(outText);

    for (std::string line; std::getline(iss, line); ) {
        boost::algorithm::trim(line);
        if (!line.empty()) cout << line << endl;
    }

    cv::imwrite("out.png", image_final);

    return 0;
}

test image

output:

1122-03-08 18:10
2122-030 18:10

I even tried to whitelist these characters (which will not be the case in the final version) but still getting very bad results.

CodePudding user response:

It looks like the main issue is setting bytes_per_pixel to 3 instead of 1 in api->SetImage.

The image after cv::adaptiveThreshold is 1 color channel (1 byte per pixel) and not 3.

Replace api->SetImage(image_final.data, image_final.cols, image_final.rows, 3, image_final.step); with:

api->SetImage(image_final.data, image_final.cols, image_final.rows, 1, image_final.step);

Replace cv::imread(imPath, CV_8UC1) with cv::imread(imPath, cv::IMREAD_GRAYSCALE)


You may also try replacing tesseract::PSM_AUTO_ONLY with tesseract::PSM_AUTO or tesseract::PSM_SINGLE_BLOCK.

According to the comment in the header file:

PSM_AUTO_ONLY = 2, ///< Automatic page segmentation, but no OSD, or OCR.

(Unless this is in purpose - I never used the C interface).


I have tried to reproduce the problem using pytesseract and Python, but I am getting an error when setting PSM to 2.
I am probably also using different version of Tesseract.

The result is perfect, and it supposed to be perfect with the image from your post.

Python code:

import cv2
from pytesseract import pytesseract

# Tesseract path
pytesseract.tesseract_cmd = "C:\\Program Files\\Tesseract-OCR\\tesseract.exe"

img = cv2.imread("out.png", cv2.IMREAD_GRAYSCALE)  # Read input image as Grayscale
  
text = pytesseract.image_to_string(img, config="-c tessedit"
                                               "_char_whitelist=' '0123456789-:"
                                               " --psm 3 "
                                               "lang='eng'")

print(text)

Output:
2022-03-08 18:19:15

  • Related