I am new to Tess4J and to JNA, so apologies if this is obvious, but I have not been able to find in the blogs. I am on Ubuntu 18.04, running Java 17.0.1, Tomcat 10.0. I have built a simple dynamic web app, details below. I installed resources as such:
sudo apt install tesseract-ocr tesseract-ocr-rus libleptonica-dev
First I will mention that I am able to handle my test doc with no problems from the command line:
tesseract /tmp/output-0.jpg /tmp/file -l rus eng
But when I try the same from Java the JVM crashes.
The relevant Java inside my class OCR is as follows:
private static final String tessDir = "/usr/share/tesseract-ocr/4.00/";
private static final String libDir = "/usr/lib/x86_64-linux-gnu";
private ITesseract ocr = new Tesseract();
public OCR() {
System.setProperty("java.library.path", System.getProperty("java.library.path") ":" libDir);
ocr.setDatapath(tessDir);
}
public String doOcr (String inputDirName, String outputDirName, List<File> files, Set<Lang> langs) throws IOException {
File f1 = new File("/tmp/output-0.jpg");
String s = "";
ocr.setLanguage("rus eng");
try {
s = ocr.doOCR(f1);
} catch (Exception e) {
throw new RuntimeException(e.getMessage());
}
return s;
}
pom.xml:
<dependency>
<groupId>net.java.dev.jna</groupId>
<artifactId>jna-platform</artifactId>
<version>5.6.0</version>
</dependency>
<dependency>
<groupId>com.github.jai-imageio</groupId>
<artifactId>jai-imageio-core</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.6.0</version>
</dependency>
<dependency>
<groupId>net.sourceforge.lept4j</groupId>
<artifactId>lept4j</artifactId>
<version>1.16.1</version>
</dependency>
<dependency>
<groupId>org.ghost4j</groupId>
<artifactId>ghost4j</artifactId>
<version>1.0.1</version>
</dependency>
The crash log looks like this:
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f67aeed2c27, pid=23274, tid=23912
#
# JRE version: Java(TM) SE Runtime Environment (17.0.1 12) (build 17.0.1 12-LTS-39)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (17.0.1 12-LTS-39, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, parallel gc, linux-amd64)
...
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [libtesseract.so.4 0xa1c27] tesseract::Tesseract::recog_all_words(PAGE_RES*, ETEXT_DESC*, TBOX const*, char const*, int) 0x437
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j com.sun.jna.Native.invokePointer(Lcom/sun/jna/Function;JI[Ljava/lang/Object;)J 0
j com.sun.jna.Function.invokePointer(I[Ljava/lang/Object;)Lcom/sun/jna/Pointer; 7
j com.sun.jna.Function.invoke([Ljava/lang/Object;Ljava/lang/Class;ZI)Ljava/lang/Object; 385
j com.sun.jna.Function.invoke(Ljava/lang/reflect/Method;[Ljava/lang/Class;Ljava/lang/Class;[Ljava/lang/Object;Ljava/util/Map;)Ljava/lang/Object; 271
j com.sun.jna.Library$Handler.invoke(Ljava/lang/Object;Ljava/lang/reflect/Method;[Ljava/lang/Object;)Ljava/lang/Object; 390
j jdk.proxy3.$Proxy10.TessBaseAPIGetUTF8Text(Lnet/sourceforge/tess4j/ITessAPI$TessBaseAPI;)Lcom/sun/jna/Pointer; 16 jdk.proxy3
j net.sourceforge.tess4j.Tesseract.getOCRText(Ljava/lang/String;I)Ljava/lang/String; 269
j net.sourceforge.tess4j.Tesseract.doOCR(Ljavax/imageio/IIOImage;Ljava/lang/String;Ljava/awt/Rectangle;I)Ljava/lang/String; 18
j net.sourceforge.tess4j.Tesseract.doOCR(Ljava/io/File;Ljava/awt/Rectangle;)Ljava/lang/String; 126
j net.sourceforge.tess4j.Tesseract.doOCR(Ljava/io/File;)Ljava/lang/String; 3
j mypackage.OCR.doOcr(Ljava/lang/String;Ljava/lang/String;Ljava/util/List;Ljava/util/Set;)Ljava/lang/String; 32
In libDir are indeed libtesseract.so.4 -> libtesseract.so.4.0.0
and liblept.so -> liblept.so.5.0.2
.
So what am I missing? Version mismatch somewhere?
CodePudding user response:
Not quite sure if you are aware, but there seems to be an API available that you can simply use instead of directly pointing to your Installation Lib Folder.
This means that this would be platform agnostic and would work whether on windows/linux.
Example of Usage:
The pom.xml build file
<project>
<modelVersion>4.0.0</modelVersion>
<groupId>org.bytedeco.tesseract</groupId>
<artifactId>BasicExample</artifactId>
<version>1.5.7-SNAPSHOT</version>
<properties>
<exec.mainClass>BasicExample</exec.mainClass>
</properties>
<dependencies>
<dependency>
<groupId>org.bytedeco</groupId>
<artifactId>tesseract-platform</artifactId>
<version>5.0.0-1.5.7-SNAPSHOT</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>.</sourceDirectory>
</build>
</project>
The BasicExample.java source file
import org.bytedeco.javacpp.*;
import org.bytedeco.leptonica.*;
import org.bytedeco.tesseract.*;
import static org.bytedeco.leptonica.global.lept.*;
import static org.bytedeco.tesseract.global.tesseract.*;
public class BasicExample {
public static void main(String[] args) {
BytePointer outText;
TessBaseAPI api = new TessBaseAPI();
// Initialize tesseract-ocr with English, without specifying tessdata path
if (api.Init(null, "eng") != 0) {
System.err.println("Could not initialize tesseract.");
System.exit(1);
}
// Open input image with leptonica library
PIX image = pixRead(args.length > 0 ? args[0] : "/usr/src/tesseract/testing/phototest.tif");
api.SetImage(image);
// Get OCR result
outText = api.GetUTF8Text();
System.out.println("OCR output:\n" outText.getString());
// Destroy used object and release memory
api.End();
outText.deallocate();
pixDestroy(image);
}
}
Project Documentation:
https://github.com/bytedeco/javacpp-presets/tree/master/tesseract
Relevant StackOvervlow for V4: Using Tesseract from java