I want to read parquet files from an AWS S3 bucket in a for loop.
Here's my code (that doesn't work):
session = boto3.Session(
aws_access_key_id=key,
aws_secret_access_key=secret,
region_name=region_name)
s3 = session.resource('s3')
bucket = s3.Bucket(bucket_name)
for obj in bucket.objects.filter(Prefix=folder_path):
response = obj.get()
df = pd.read_parquet(response['Body'])
# some data processing
It prints the following errors:
ValueError: I/O operation on closed file
and ArrowInvalid: Called Open() on an uninitialized FileSource
.
What should I fix here?
CodePudding user response:
pandas.read_parquet()
expects a a reference to the file to read, not the file contents itself as you provide it.
From the documentation:
path : str, path object or file-like object
String, path object (implementing
os.PathLike[str]
), or file-like object implementing a binaryread()
function. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be:file://localhost/path/to/table.parquet
. A file URL can also be a path to a directory that contains multiple partitioned parquet files. Both pyarrow and fastparquet support paths to directories as well as file URLs. A directory path could be:file://localhost/path/to/tables
ors3://bucket/partition_dir
.
As you can see you can provide an S3-url as path, so the least intrusive change to make it work would probably be this:
for obj in bucket.objects.filter(Prefix=folder_path):
obj_url = f"s3://{obj.bucket_name}/{obj.key}"
df = pd.read_parquet(obj_url)
Alternatively "How to read a list of parquet files from S3 as a pandas dataframe using pyarrow?" lists several other solutions.