I need extract images from a PDF page section.
For example consider there is a PDF page which has couple of images on top of the page & couple of images on bottom of the page. I want to extract the images on top of the page.
So far what I tried is :
- Using ghostscript cropped the pdf -
gs -o$croppedPdfFilepath -sDEVICE=pdfwrite -c "[/CropBox [31.46 690.22 560.54 839]" -c "/PAGES pdfmark" -sPageList=12 -f $originalPdfFilepath
- Then pass the cropped image to pdfimages to extract the images -
pdfimages -j "$croppedPdfFilepath" $outputDirectory/image
But the problem is pdfimages
is extracting all the images on that page (From the top & the bottom), even though when I view the cropped PDF it has only the images on top of the page.
After some research it looks like the CropBox
only hides the cropped content from view but the PDF source still has the content.
Any guidance to remove the content from the PDF page or any other approach will be helpful. I'm using php
to do it programatically.
References
CodePudding user response:
If you need to extract images based on their page position, you can do it pretty easily with pdftohtml
by parsing the output and then checking for the position of elements using their xml attributes. Here's a very basic example that puts the full path of images in an array if they are positioned less than 200
from the top:
$pdf = '/path/to/test.pdf';
$files = [];
$xml = shell_exec('pdftohtml -stdout -xml ' . $pdf);
$dom = new DOMDocument();
$dom->loadXml($xml);
$images = $dom->getElementsByTagName('image');
foreach ($images as $image) {
$top = $image->getAttribute('top');
if ($top < 200) {
$files[] = dirname($pdf) . '/' . $image->getAttribute('src');
}
}
print_r($files);
Note, contrary to the man page for pdftohtml
, which indicates that it "generates its output in the current working directory", my experience is that it will always generate output in the same directory as the pdf being read.