Home > Software design >  Read PDF in base64 format with a PDF library in Python
Read PDF in base64 format with a PDF library in Python

Time:01-13

I have a base64 string and I need to read it with a Python library. I can do that with the following steps:

  1. Decode the PDF in base64
  2. Save it into a new file
  3. Read it with libraries like PyPDF2

But since I can't create a new file, I need to read it using another process. I tried using the BufferedWriter class, that is part of the io library but I believe that it is not the right way.

Edit 1

I can't create new files because I will be running the code in a serverless API host. And what I need to do is get the Base64 string and read it in a way that I can split each page into a new file and then save those files into a blob storage (but the split and save part are easy, the problem is the "read Base64 string without creating a new file").

CodePudding user response:

PDF is a binary file format, not a base64 string. Base64 is a way of encoding binary data as ASCII text.

What you need to do is decode the base64 string with base64.b64decode into a byte array, then use a PDF library like PyPDF2 to read that byte array either directly or through a BytesIO object :

import base64
import io
from PyPDF2 import PdfReader

bytes=base64.b64decode(thatString)
f=io.BytesIO(bytes)
reader = PdfReader(f)
page = reader.pages[0]
  • Related