Extract Text in Natural reading order using pymupdf (fitz)-CodePudding

I am trying to extract the text using pymupdf or flitz by applying this tutorial https://towardsdatascience.com/extracting-headers-and-paragraphs-from-pdf-using-pymupdf-676e8421c467

instead of blocks = page.getText("dict")["blocks"]

I wrote blocks = page.get_text("dict", sort=True)["blocks"]

according to https://pymupdf.readthedocs.io/en/latest/recipes-text.html

But still, the text is not in the order I expect. The first paragraph will appear in the middle.

This happens when a page has more than one column of text.

CodePudding user response：

You made a good first step using the sort argument. But please note that PDF can address each single character separately, such that every basic sorting approach may fail with the "right" PDF counter example.

If a page contains n text characters, then there exist n! different ways to encode the page - all of them looking identical, but only one of them extracting the "natural" reading sequence right away.

If your page contains tables, or if the text is organized in multiple columns (as is customary in newspapers), then you must invest additional logic to cope with that.

If you use the PyMuPDF module, you can extract text in a layout preserving manner: python -m fitz gettext -mode layout ....

If you need to achieve a similar effect within your script, you may be forced to use text extraction detailed down to each single character: page.get_text("rawdict") and use the returned character positions to bring them in the right sequence.

BTW the sort parameter causes the text blocks to be sorted ascending by (1) vertical, (2) horizontal coordinate of their bounding boxes. So if in a multi-column page the second column has a slightly higher y-coordinate, it will come before the first column. To handle such a case you must use this knowledge for making specialized code.

Assuming you know have a 2-column page, then the following code snippet might be used:

width2 = page.rect.width / 2  # half of the page width
left = page.rect   (0, 0, -width2, 0)  # the left half page
right = page.rect   (width2, 0, 0, 0)  # the right half page
# now extract the 2 halves spearately:
lblocks = page.get_text("dict", clip=left, sort=True)["blocks"]
rblocks = page.get_text("dict", clip=right, sort=True)["blocks"]
blocks = lblocks   rblocks
# now process 'blocks'
...