Extract images from PDF page section-CodePudding

I need extract images from a PDF page section.

For example consider there is a PDF page which has couple of images on top of the page & couple of images on bottom of the page. I want to extract the images on top of the page.

So far what I tried is :

Using ghostscript cropped the pdf - gs -o$croppedPdfFilepath -sDEVICE=pdfwrite -c "[/CropBox [31.46 690.22 560.54 839]" -c "/PAGES pdfmark" -sPageList=12 -f $originalPdfFilepath
Then pass the cropped image to pdfimages to extract the images - pdfimages -j "$croppedPdfFilepath" $outputDirectory/image

But the problem is pdfimages is extracting all the images on that page (From the top & the bottom), even though when I view the cropped PDF it has only the images on top of the page.

After some research it looks like the CropBox only hides the cropped content from view but the PDF source still has the content.

Any guidance to remove the content from the PDF page or any other approach will be helpful. I'm using php to do it programatically.

References

https://stackoverflow.com/a/6184547/4273867

CodePudding user response：

If you need to extract images based on their page position, you can do it pretty easily with pdftohtml by parsing the output and then checking for the position of elements using their xml attributes. Here's a very basic example that puts the full path of images in an array if they are positioned less than 200 from the top:

$pdf   = '/path/to/test.pdf';
$files = [];
$xml   = shell_exec('pdftohtml -stdout -xml ' . $pdf);
$dom   = new DOMDocument();
$dom->loadXml($xml);
$images = $dom->getElementsByTagName('image');
foreach ($images as $image) {
    $top = $image->getAttribute('top');
    if ($top < 200) {
        $files[] = dirname($pdf) . '/' . $image->getAttribute('src');
    }
}
print_r($files);

Note, contrary to the man page for pdftohtml, which indicates that it "generates its output in the current working directory", my experience is that it will always generate output in the same directory as the pdf being read.