Home > database >  Streaming parquet files from S3 (Python)
Streaming parquet files from S3 (Python)

Time:09-17

I should begin by saying that this is not running in Spark.

What I am attempting to do is

  • stream n records from a parquet file in S3
  • process
  • stream back to a different file in S3 ...but am only inquiring about the first step.

Have tried various things like:

from pyarrow import fs
from pyarrow.parquet import ParquetFile

s3 = fs.S3FileSystem(access_key=aws_key, secret_key=aws_secret)
with s3.open_input_stream(filepath) as f:
    print(type(f))  # pyarrow.lib.NativeFile
    parquet_file = ParquetFile(f)
    for i in parquet_file.iter_batches():  # .read_row_groups() would be better
        # process

...but getting OSError: only valid on seekable files , and not sure how to get around it.

Apologies if this is a duplicate. I searched but didn't find quite the fit I was looking for.

CodePudding user response:

Try using open_input_file which 'Open an input file for random access reading.' instead of open_input_stream which 'Open an input stream for sequential reading.'

For context, in a parquet file the metadata is at the end so you need to be able to go back and forth in the file.

  • Related