I've been struggling with this issue for a while now and I just don't know what's going on. My code is as messy as an amateur code should be, but it usually works (except when it doesn't).
The code bellow converts an ordinary pdf file into an ocr one.
def ToOCR(directory):
pdf=fitz.open(directory)
for i in pdf:
CONVERT=3
#This was copied from somewhere else in stackoverflow
pix = i.get_pixmap(matrix = fitz.Matrix(CONVERT, CONVERT))
img=Image.frombytes("RGB",[pix.width, pix.height],pix.samples)
gauss = cv2.GaussianBlur(np.array(img), (3, 3), 0)
data=pyt.image_to_data(gauss,
output_type=pyt.Output.DICT,
config='-c preserve_interword_spaces=1 --oem 1 --psm 1 -l spa',
lang='spa')
for m in range(len(data['text'])): #You can see here is where I took over XD
if len(data['text'][m])>0:
llenght=0
fz=1
while llenght<0.9*data['width'][m]/CONVERT: #Here I set the font size
fz =1
llenght=fitz.get_text_length(data['text'][m], fontname="Times-Roman", fontsize=fz)
i.insert_text((int(data['left'][m]/CONVERT),int((data['top'][m] data['height'][m])/CONVERT)),
data['text'][m],
fontname="Times-Roman",
fontsize=fz,
color=None,
fill=None,
render_mode=0,
border_width=1,
rotate=0,
morph=None,
stroke_opacity=0,
fill_opacity=0,
overlay=True,
oc=0)
dest_dir=directory[:-3]
pdf.save(dest_dir 'ocr.pdf')
pdf.close()
Sometimes (I don't even know "when") the text layer just won't be inserted into the right place of the page, neither with the right size.
Consistently, however, when this happens, the text layer is always inserted at the bottom-left corner of the pdf page, in a smaller font. The text is properly extracted and organized, as if it had been extracted from a smaller version of the page, pasted on its corner.
I decided to ask this question today because the problem appeared with a scanned document from a scanner that usually works for my code.
Yesterday, I manually selected a higher quality and set the scanner on black and withe mode. This is, unfortunately, the single relevant information I can provide, as I am not an expert in any of this subjects.
I will appreciate any suggestion.
CodePudding user response:
I realised there was no problem with the text detection and positioning.
Apparently (as mentioned here), "due to inconsistencies in how the PDF was created, it is possible the origin of that particular document is not the standard global origin on top-left."
According to the same post, the solution turned out to be as simple as adding:
if not(i._isWrapped):
i.wrap_contents()
It might be useful to notice that the original post uses i._wrapContents()
which is a mistake or maybe was deprecated, as i.wrapContents(), which rises a deprecation warning itself.