How to properly extract Japanese txt from PDF files-CodePudding

I need to extract the text from the pdf files.

The problem is some pages of the files is the scanned pdf, which the text can't be retrieved using the PyPDF or PDFMiner. So the text is empty.

Could anyone please give me a hint of how to process?

CodePudding user response：

I don't think there's a quick solution to deal with the Unicode, especially the Japanese.

One of a solution that we could go:

Iterate over the page, determine whether the page is scanned pdf or not. This could be done using the PyMUPDF, take a look at this answer.
If the page is not scanned pdf, we can extract the text from pdf as usual.
For the page which is not scanned pdf, we can convert the pdf into .png image using the pdf2image, than use pytesseract to extract data. Here by the sample code on how to read the data from image.
You might need to do some extra data work in order to get the properly words.

import cv2
import pytesseract
from pytesseract import Output

img = cv2.imread('invoice-sample.jpg')

d = pytesseract.image_to_data(img, output_type=Output.DICT)
print(d.keys())

Regarding the tesseract, you can find more in this article.