Home > Net >  How can I scrape PDFs within a Lambda function and save them to an S3 bucket?
How can I scrape PDFs within a Lambda function and save them to an S3 bucket?

Time:08-12

I'm trying to develop a simple lambda function that will scrape a pdf and save it to an s3 bucket given the url and the desired filename as input data. I keep receiving the error "Read-only file system,' and I'm not sure if I have to change the bucket permissions or if there is something else I am missing. I am new to S3 and Lambda and would appreciate any help.

This is my code:

import urllib.request
    import json
    import boto3


def lambda_handler(event, context):   
    s3 = boto3.client('s3') 
    url = event['url']
    filename = event['filename']   ".pdf"
    response = urllib.request.urlopen(url)   
    file = open(filename, 'w')
    file.write(response.read())
    s3.upload_fileobj(response.read(), 'sasbreports', filename)
    file.close()

This was my event file:

{
  "url": "https://purpose-cms-preprod01.s3.amazonaws.com/wp-content/uploads/2022/03/09205150/FY21-NIKE-Impact-Report_SASB-Summary.pdf",
  "filename": "nike"
}

When I tested the function, I received this error:

{
  "errorMessage": "[Errno 30] Read-only file system: 'nike.pdf.pdf'",
  "errorType": "OSError",
  "requestId": "de0b23d3-1e62-482c-bdf8-e27e82251941",
  "stackTrace": [
    "  File \"/var/task/lambda_function.py\", line 15, in lambda_handler\n    file = open(filename   \".pdf\", 'w')\n"
  ]
}

CodePudding user response:

AWS Lambda functions can only write to the /tmp/ directory. All other directories are Read-Only.

Also, there is a default limit of 512MB for storage in /tmp/, so make sure you delete the files after upload it to S3 for situations where the Lambda environment is re-used for future executions.

CodePudding user response:

AWS Lambda has limited space in /tmp, the sole writable location. Writing into this space can be dangerous without a proper disk management since this storage is kept alive across multiple executions. It can lead to a saturation or unexpected file share with previous requests. Instead of saving locally the PDF, write it directly to S3, without involving file system this way:

import urllib.request
import json
import boto3


def lambda_handler(event, context):   
    s3 = boto3.client('s3') 
    url = event['url']
    filename = event['filename']
    response = urllib.request.urlopen(url)   
    s3.upload_fileobj(response.read(), 'sasbreports', filename)

BTW: The .pdf appending should be removed according your use case.

  • Related