Tesseract very low detection quality-CodePudding

Trying to read some data with tesseract but it's already strugling with date and time, so I created a minimal test case.

code:

#include <string>
#include <sstream>
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <opencv2/opencv.hpp>
#include <opencv2/imgproc.hpp>
#include <boost/algorithm/string/trim.hpp>
using namespace std;
using namespace cv;

int main(int argc, const char * argv[]) {

    string outText, imPath = argv[1];
    cv::Mat image_final = cv::imread(imPath, CV_8UC1);

    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    api->Init(NULL, "eng", tesseract::OEM_LSTM_ONLY);
    api->SetPageSegMode(tesseract::PSM_AUTO_ONLY);
    cv::adaptiveThreshold(image_final,image_final,255,ADAPTIVE_THRESH_MEAN_C, cv::THRESH_BINARY,11,2);

    api->SetImage(image_final.data, image_final.cols, image_final.rows, 3, image_final.step);
    api->SetVariable("tessedit_char_whitelist", "0123456789- :");
    outText = string(api->GetUTF8Text());
    api->End();

    std::istringstream iss(outText);

    for (std::string line; std::getline(iss, line); ) {
        boost::algorithm::trim(line);
        if (!line.empty()) cout << line << endl;
    }

    cv::imwrite("out.png", image_final);

    return 0;
}

output:

1122-03-08 18:10
2122-030 18:10

I even tried to whitelist these characters (which will not be the case in the final version) but still getting very bad results.

CodePudding user response：

It looks like the main issue is setting bytes_per_pixel to 3 instead of 1 in api->SetImage.

The image after cv::adaptiveThreshold is 1 color channel (1 byte per pixel) and not 3.

Replace api->SetImage(image_final.data, image_final.cols, image_final.rows, 3, image_final.step); with:

api->SetImage(image_final.data, image_final.cols, image_final.rows, 1, image_final.step);

Replace cv::imread(imPath, CV_8UC1) with cv::imread(imPath, cv::IMREAD_GRAYSCALE)

You may also try replacing tesseract::PSM_AUTO_ONLY with tesseract::PSM_AUTO or tesseract::PSM_SINGLE_BLOCK.

According to the comment in the header file:

PSM_AUTO_ONLY = 2, ///< Automatic page segmentation, but no OSD, or OCR.

(Unless this is in purpose - I never used the C interface).

I have tried to reproduce the problem using pytesseract and Python, but I am getting an error when setting PSM to 2.
I am probably also using different version of Tesseract.

The result is perfect, and it supposed to be perfect with the image from your post.

Python code:

import cv2
from pytesseract import pytesseract

# Tesseract path
pytesseract.tesseract_cmd = "C:\\Program Files\\Tesseract-OCR\\tesseract.exe"

img = cv2.imread("out.png", cv2.IMREAD_GRAYSCALE)  # Read input image as Grayscale
  
text = pytesseract.image_to_string(img, config="-c tessedit"
                                               "_char_whitelist=' '0123456789-:"
                                               " --psm 3 "
                                               "lang='eng'")

print(text)

Output:
2022-03-08 18:19:15