I need a python program that can extract videos audio and images from a pdf. I have tried using libraries such as PyPDF2 and Pillow, but I was unable to get all three to work let alone one.
CodePudding user response:
I think you could achieve this using pymupdf.
To extract images see the following: https://pymupdf.readthedocs.io/en/latest/recipes-images.html#how-to-extract-images-pdf-documents
For Sound and Video these are essentially Annotation
types.
The following "annots" function would get all the annotations of a specific type for a PDF page:
https://pymupdf.readthedocs.io/en/latest/page.html#Page.annots
Annotation
types are as follows:
https://pymupdf.readthedocs.io/en/latest/vars.html#annotationtypes
Once you have acquired an annotation I think you can use the get_file
method to extract the content ( see: https://pymupdf.readthedocs.io/en/latest/annot.html#Annot.get_file)
Hope this helps!
CodePudding user response:
@George Davis-Diver can you please let me have an example PDF with video?
Sounds and videos are embedded in their specific annotation types. Both are no FileAttachment annotation, so the respective mathods cannot be used.
For a sound annotation, you must use `annot.get_sound()`` which returns a dictionary where one of the keys is the binary sound stream.
Images on the other hand may for sure be embedded as FileAttachment annotations - but this is unusual. Normally they are displayed on the page independently. Find out a page's images like this:
import fitz
from pprint import pprint
doc=fitz.open("your.pdf")
page=doc[0] # first page - use 0-based page numbers
pprint(page.get_images())
[(1114, 0, 1200, 1200, 8, 'DeviceRGB', '', 'Im1', 'FlateDecode')]
# extract the image stored under xref 1114:
img = doc.extract_image(1114)
This is a dictionary with image metadata and the binary image stream. Note that PDF stores transparency data of an image separately, which therefore needs some additional care - but let us postpone this until actually happening.