I do not want to write and read the same document in python-CodePudding

I have pdf files where I want to extract info only from the first page. My solution is to:

Use PyPDF2 to read from S3 and save only the first page.
Read the same one-paged-pdf I saved, convert to byte64 and analyse it on AWS Textract.

It works but I do not like this solution. What is the need to save and still read the exact same file? Can I not use the file directly at runtime?

Here is what I have done that I don't like:

from PyPDF2 import PdfReader, PdfWriter
from io import BytesIO
import boto3

def analyse_first_page(bucket_name, file_name):
    s3 = boto3.resource("s3")
    obj = s3.Object(bucket_name, file_name)
    fs = obj.get()['Body'].read()
    pdf = PdfReader(BytesIO(fs), strict=False)
    writer = PdfWriter()
    page = pdf.pages[0]
    writer.add_page(page)
    
    # Here is the part I do not like
    with open("first_page.pdf", "wb") as output:
        writer.write(output)

    with open("first_page.pdf", "rb") as pdf_file:
        encoded_string = bytearray(pdf_file.read())

    #Analyse text
    textract = boto3.client('textract')
    response = textract.detect_document_text(Document={"Bytes": encoded_string})

    return response

analyse_first_page(bucket, file_name)

Is there no AWS way to do this? Is there no better way to do this?

CodePudding user response：

You can use BytesIO as stream in memory without write to file then read it again.

with BytesIO() as bytes_stream:
    writer.write(bytes_stream)
    bytes_stream.seek(0)
    encoded_string = b64encode(bytes_stream.getvalue())