Home > other >  How to make invisible (e.g. OCR) text visible after removing text-images from PDF with Ghostscript
How to make invisible (e.g. OCR) text visible after removing text-images from PDF with Ghostscript

Time:09-28

I used gs -o 'out.pdf' -sDEVICE=pdfwrite -dFILTERIMAGE 'in.pdf' to remove all images from some PDF files to minimize their file sizes. Now in some of those PDFs, the result is invisible text, as they only consisted of scanned pages with an invisible OCR layer on top. Is there some way to make that OCR text visible?

CodePudding user response:

The answer is very very dependent on how the OCR was done, here is an exceptionally perfect result sample from AWS-textract (reality is im(g)perfect as it depends on each image)

Several things to note, the colorless text is often not aligned with the real letter positions since character word blocks or lines need to be averaged out, so there is a tendency for lower in most cases even to the point (pun :-) in worst lower cases (pun :-) it looks just as high as under lines (yet another:-) width is often set to 1 point, no stroke, no fill.

enter image description here

When you strip the image then nothing shows

enter image description here

At this juncture you have a few choices, but generally you need to blacken what's left. And cpdf can in some cases do that well, however I had no sucess with using:-

cpdf -blacktext -color black -opacity 1.0 in.pdf -o out.pdf

I had hoped it would do this but alas not today. In fact any command line tool had problems with the "invisible text", except its clearly seen by pdftotext thus could be reprinted as PDF.

enter image description here

The best I could do is use a GUI editor to recolor the text so Inkscape or similar programmable graphics app or API like Acrobat/iText etc. will most likely be needed to change text appearance

enter image description here

CodePudding user response:

You can do:

cpdf -remove-all-text in.pdf -o out.pdf
  • Related