Home > OS >  Extract text from Linearized PDF by pdfbox
Extract text from Linearized PDF by pdfbox

Time:09-07

I am using org.apache.pdfbox.text.PDFTextStripper version 2.0.26. It works good for most PDFs. But It cannot extract text correctly from Linearized PDF: enter image description here enter image description here

Here we can see the infty isolated svg formula from the pdf and its OCR extracted characters Yj= γEj 1/θ(Ej)θ−1.... which is meaningless for this type of reversal. A copy of math tables or formulas as images is usually the best possible solution otherwise the result is highly likely to be corrupted. Note how some braces are recognized but not some critical ones.

enter image description here enter image description here enter image description here

We can see why that will happen by looking at the outline of that area, looks like CMEX10 is one of the worst defined as text. θ γLj 1/θ(Lj)θ−1 enter image description here

  • Related