I am building a docker image based on alpine that has a dependency with tesseract for OCR. The tesseract site list two flavors of English, eng (modern english) and enm (middle english). However, I am having issues getting the eng version installed on Alpine.
My Dockerfile has the following:
FROM eclipse-temurin:17-jre-alpine as tesseract-master
RUN apk update && apk add tesseract-ocr
RUN apk update && apk add tesseract-ocr-data-eng
This fails to find the eng language package. During the build process, repo is listed and it is clear that it does not have the eng package.
I am able to install the enm package, but I feel like there will be issues since it is for middle english.
Has anyone had success installing the eng package on Alpine?
CodePudding user response:
If you look at the content one of those packages for a language, for example the tesseract-ocr-data-enm
one, you will quickly realise it contains only one file:
- /usr/share/tessdata/enm.traineddata
Source: https://pkgs.alpinelinux.org/contents?name=tesseract-ocr-data-enm&branch=v3.17&arch=aarch64
Now, if you reverse engineer it, you can try to find which package does contains the file /usr/share/tessdata/eng.traineddata, and it is, with no big surprise, the default package: tesseract-ocr
.
Source: https://pkgs.alpinelinux.org/contents?file=eng.traineddata&branch=v3.17&arch=aarch64
So, your Dockerfile should simply be:
FROM eclipse-temurin:17-jre-alpine as tesseract-master
RUN apk add --no-cache \
tesseract-ocr