I have pdf files where I want to extract info only from the first page. My solution is to:
- Use PyPDF2 to read from S3 and save only the first page.
- Read the same one-paged-pdf I saved, convert to byte64 and analyse it on AWS Textract.
It works but I do not like this solution. What is the need to save and still read the exact same file? Can I not use the file directly at runtime?
Here is what I have done that I don't like:
from PyPDF2 import PdfReader, PdfWriter
from io import BytesIO
import boto3
def analyse_first_page(bucket_name, file_name):
s3 = boto3.resource("s3")
obj = s3.Object(bucket_name, file_name)
fs = obj.get()['Body'].read()
pdf = PdfReader(BytesIO(fs), strict=False)
writer = PdfWriter()
page = pdf.pages[0]
writer.add_page(page)
# Here is the part I do not like
with open("first_page.pdf", "wb") as output:
writer.write(output)
with open("first_page.pdf", "rb") as pdf_file:
encoded_string = bytearray(pdf_file.read())
#Analyse text
textract = boto3.client('textract')
response = textract.detect_document_text(Document={"Bytes": encoded_string})
return response
analyse_first_page(bucket, file_name)
Is there no AWS way to do this? Is there no better way to do this?
CodePudding user response:
You can use BytesIO
as stream in memory without write to file then read it again.
with BytesIO() as bytes_stream:
writer.write(bytes_stream)
bytes_stream.seek(0)
encoded_string = b64encode(bytes_stream.getvalue())