I have a folder containing a lot of sub-folders, with PDF files inside. It's a real mess to find information in these files, so I'm making a program to parse these folders and files, searching for a keyword in the PDF files, and returning the names of the PDF files containing the keyword.
And it's working. Almost, actually.
I have this error: PyPDF2.errors.PdfReadError: PDF starts with '♣▬', but '%PDF-' expected
when my program reaches some folders (hard to know which one exactly). From my point of view, all the PDF files in my folders are the same, so I don't understand why my program works with some files and doesn't work with others.
Thank you in advance for your responses.
CodePudding user response:
disclaimer: I am the author of borb
, the library mentioned in this answer
PDF documents caught in the wild will sometimes start with non-pdf bytes (a header that is not really part of the PDF spec). This can cause all kinds of problems.
PDF will (internally) keep track of all the byte offsets of objects in the file (e.g. "object 10 starts at byte 10202"). This header makes it harder to know where an object starts.
- Do we start counting at the start of the file?
- Or at the start of where the file behaves like a PDF?
If you just want to extract text from a PDF (to be able to check it for content and keywords), you can try to use borb
.
borb
will look for the start of the PDF within the first 1MB of the file (thus potentially ignoring your faulty header). If this turns out to corrupt the XREF (cross reference table, containing all byte addresses of objects) it will simply build a new one.
This is an example of how to extract text from a PDF using borb
:
import typing
from borb.pdf.document.document import Document
from borb.pdf.pdf import PDF
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction
def main():
# read the Document
doc: typing.Optional[Document] = None
l: SimpleTextExtraction = SimpleTextExtraction()
with open("output.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [l])
# check whether we have read a Document
assert doc is not None
# print the text on the first Page
print(l.get_text_for_page(0))
if __name__ == "__main__":
main()
You can find more examples in the examples repository.
CodePudding user response:
PdfFileReader has a strict
attribute. Use it:
reader = PdfFileReader("example.pdf", strict=False)
If you're still getting issues, please open an issue on Github - but only if you can share a pdf code that caused the issue: https://github.com/py-pdf/PyPDF2