Home > OS >  Read parquet files from S3 bucket in a for loop
Read parquet files from S3 bucket in a for loop

Time:11-06

I want to read parquet files from an AWS S3 bucket in a for loop.

Here's my code (that doesn't work):

session = boto3.Session(
                    aws_access_key_id=key,
                    aws_secret_access_key=secret,
                    region_name=region_name)
                    
s3 = session.resource('s3')

bucket = s3.Bucket(bucket_name)

for obj in bucket.objects.filter(Prefix=folder_path):

    response = obj.get()

    df = pd.read_parquet(response['Body'])
    
    # some data processing

It prints the following errors: ValueError: I/O operation on closed file and ArrowInvalid: Called Open() on an uninitialized FileSource.

What should I fix here?

CodePudding user response:

pandas.read_parquet() expects a a reference to the file to read, not the file contents itself as you provide it.

From the documentation:

path : str, path object or file-like object

String, path object (implementing os.PathLike[str]), or file-like object implementing a binary read() function. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.parquet. A file URL can also be a path to a directory that contains multiple partitioned parquet files. Both pyarrow and fastparquet support paths to directories as well as file URLs. A directory path could be: file://localhost/path/to/tables or s3://bucket/partition_dir.

As you can see you can provide an S3-url as path, so the least intrusive change to make it work would probably be this:

for obj in bucket.objects.filter(Prefix=folder_path):
     obj_url = f"s3://{obj.bucket_name}/{obj.key}"
     df = pd.read_parquet(obj_url)

Alternatively "How to read a list of parquet files from S3 as a pandas dataframe using pyarrow?" lists several other solutions.

  • Related