Trying to read some data with tesseract but it's already strugling with date and time, so I created a minimal test case.
code:
#include <string>
#include <sstream>
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <opencv2/opencv.hpp>
#include <opencv2/imgproc.hpp>
#include <boost/algorithm/string/trim.hpp>
using namespace std;
using namespace cv;
int main(int argc, const char * argv[]) {
string outText, imPath = argv[1];
cv::Mat image_final = cv::imread(imPath, CV_8UC1);
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
api->Init(NULL, "eng", tesseract::OEM_LSTM_ONLY);
api->SetPageSegMode(tesseract::PSM_AUTO_ONLY);
cv::adaptiveThreshold(image_final,image_final,255,ADAPTIVE_THRESH_MEAN_C, cv::THRESH_BINARY,11,2);
api->SetImage(image_final.data, image_final.cols, image_final.rows, 3, image_final.step);
api->SetVariable("tessedit_char_whitelist", "0123456789- :");
outText = string(api->GetUTF8Text());
api->End();
std::istringstream iss(outText);
for (std::string line; std::getline(iss, line); ) {
boost::algorithm::trim(line);
if (!line.empty()) cout << line << endl;
}
cv::imwrite("out.png", image_final);
return 0;
}
output:
1122-03-08 18:10
2122-030 18:10
I even tried to whitelist these characters (which will not be the case in the final version) but still getting very bad results.
CodePudding user response:
It looks like the main issue is setting bytes_per_pixel
to 3
instead of 1
in api->SetImage
.
The image after cv::adaptiveThreshold
is 1 color channel (1 byte per pixel) and not 3.
Replace api->SetImage(image_final.data, image_final.cols, image_final.rows, 3, image_final.step);
with:
api->SetImage(image_final.data, image_final.cols, image_final.rows, 1, image_final.step);
Replace cv::imread(imPath, CV_8UC1)
with cv::imread(imPath, cv::IMREAD_GRAYSCALE)
You may also try replacing tesseract::PSM_AUTO_ONLY
with tesseract::PSM_AUTO
or tesseract::PSM_SINGLE_BLOCK
.
According to the comment in the header file:
PSM_AUTO_ONLY = 2, ///< Automatic page segmentation, but no OSD, or OCR.
(Unless this is in purpose - I never used the C interface).
I have tried to reproduce the problem using pytesseract and Python, but I am getting an error when setting PSM to 2.
I am probably also using different version of Tesseract.
The result is perfect, and it supposed to be perfect with the image from your post.
Python code:
import cv2
from pytesseract import pytesseract
# Tesseract path
pytesseract.tesseract_cmd = "C:\\Program Files\\Tesseract-OCR\\tesseract.exe"
img = cv2.imread("out.png", cv2.IMREAD_GRAYSCALE) # Read input image as Grayscale
text = pytesseract.image_to_string(img, config="-c tessedit"
"_char_whitelist=' '0123456789-:"
" --psm 3 "
"lang='eng'")
print(text)
Output:
2022-03-08 18:19:15