How to get information about character spacing and word spacing from pdf file?-CodePudding

Used Pymupdf faced the problem of getting information about the text in the pdf file I asked in the library's discord channel about the possibility of obtaining information about intervals, but they told me that the library does not know how to work with them Perhaps there are other libraries that can do this?

I tried to look in other libraries but did not find it. Maybe I missed something....

CodePudding user response：

disclaimer: I am the author of borb, the library used in this answer

Usually, the information you're looking for is hidden behind layers of abstraction. A PDF library might typically allow you to extract text (and it uses information about word and character spacing to do so), but it does not make this information available to the outside world.

You can use borb to get access to this (low level) information. The key concept here is EventListener. This is an interface. Classes implementing this interface get notified whenever a rendering event has finished.

Rendering events may include:

text being rendered
images being rendered
switching to a new page and so on

There is a class that extracts text. So I would recommend you check out its code. Looking at line 62, we see that any event that is "render a piece of text" gets redirected to its own separate method.

The method _render_text stores the TextRenderInfo objects until a page has finished rendering (at which point it will use the TextRenderInfo objects to determine the text that was on the page).

You can see the "end of page" logic in action on line 87.

Here you see that TextRenderInfo has all kinds of attributes related to position. You can use get_baseline to access it.

CodePudding user response：

i solved my problem by pdfminer.six and pymupdf by getting line and character position thx all of you