I should begin by saying that this is not running in Spark.
What I am attempting to do is
- stream n records from a parquet file in S3
- process
- stream back to a different file in S3 ...but am only inquiring about the first step.
Have tried various things like:
from pyarrow import fs
from pyarrow.parquet import ParquetFile
s3 = fs.S3FileSystem(access_key=aws_key, secret_key=aws_secret)
with s3.open_input_stream(filepath) as f:
print(type(f)) # pyarrow.lib.NativeFile
parquet_file = ParquetFile(f)
for i in parquet_file.iter_batches(): # .read_row_groups() would be better
# process
...but getting OSError: only valid on seekable files
, and not sure how to get around it.
Apologies if this is a duplicate. I searched but didn't find quite the fit I was looking for.
CodePudding user response:
Try using open_input_file
which 'Open an input file for random access reading.' instead of open_input_stream
which 'Open an input stream for sequential reading.'
For context, in a parquet file the metadata is at the end so you need to be able to go back and forth in the file.