I am trying to use borb to extract text from pdf's. Some pdfs works well but when trying to extract text from some pdf's I get extra spaces between all letters and spaces. It looks like:
I N B E T A L N I N G / G I R E R I N G A V
If I count spaces and notice that there are more than usual can I use regex in some way to remove one space everywhere ?
So that it looks like:
INBETALNING / GIRERING AV
CodePudding user response:
Disclaimer: I am the author of borb
A pdf document doesn't really contain text as is. It contains rendering instructions that a program like Adobe Reader will execute. These instructions yield something a human might interpret as text.
For instance:
- go to position 30, 50
- use font Helvetica
- set color to black
- render the characters "Hello"
- move to 36, 50
- render the characters "World"
You will notice that the space in "Hello World" is not explicitly in the rendering instructions. It could be. But doesn't need to be. And many pdf creation tools choose not to insert a space, but rather move the drawing cursor along.
Now what that means for text extraction is that software such as borb
has to guess when to insert a space.
It can tell how far apart the bounding boxes of two characters are.
Of course if the space character is not used in the rendering instructions, it might not be included in the font information. This is called font-subsetting. Where a specialised font is created, containing only the characters actually in use.
When this happens, borb
doesn't know how wide a space character is supposed to be.
borb
will try different heuristics:
- checking if the font is monospaced
- checking if enough other characters are defined (e.g. "a space is twice as wide as the character "i")
- revert to default
If you look in the code of SimpleTextExtraction
you will be able to see this logic in action.
I suggest you subclass that class, and modify it to allow you (the user) to define an acceptable space character width.
In particular have a look at this line.