extracting images from PDF with page and screen coordinate information-CodePudding

I want to extract images from PDFs retaining a knowledge of their content (page_number and coordinates on page). (Some tools (e.g. pdfminer) only emit image files with non-semantic names, e.g. Img0.bmp). I can do this with PDFBox (Java) but I'd ideally like a Python tool

My current (arbitrary) designs is to create filenames of the form:

image_<page>_<serial_in_page>_<x1>_<x2>__<y1>_<y2>.png

Currently pdfplumber exposes cooordinates but with a PDFStream and encoding information rather than an image. Code to convert the stream to a *.png would solve the problem.

(NOTE: the pdfplumber approach of rendering to the screen and capturing the known rectangle (which I use) is not a solution as the image is often degraded and frequently overwritten with text.)

(NOTE: I have had problems with several Python tools (pdfminer.six, PuMuPDF) extracting images as they make the background black which obscures black text, etc. PDFBox (Java) doesn't have this problem.)

CodePudding user response：

I don't have a solution in Python but here is a small script using Ruby and

What you may note is that the objects that were inserted most likely as one PNG are broken in two when injected into the PDF and their scaled placement is defined. Note that a raw PNG (whatever its source common resolution was) will retain number of dots but its scale when inserted into the PDF could be totally different horizontal and vertical, thus the only meaningful data is W x H in pixel values.

It is not trivial to overlay the mask on the RGB component when simply extracted but can allow for colour changes if desired.

So PDFbox is one of the simpler/better tools for blending to a suitable output, (as you have discovered) but for Python it is generally the top end library products that can identify the placement of the two images and combine into a suitable alpha output like a new PNG.

For many suggestions see Extract images from PDF without resampling, in python?.

Your related part question was knowing where those components are placed on each page since one image (and its alpha mask) could be placed multiple times such as a heading logo on each page. Again it is easy in a single command line to see which pages are referenced by a group of images, but to see which image is placed where requires analyzing each pages resources, again requiring a library interrogation of page contents, thus best done via power house libraries such as iText or any other like PDFtron for python.

For a related command in PyMuPDF see https://pymupdf.readthedocs.io/en/latest/page.html#Page.get_image_rects